Pandas Tutorial: How To Analyze Data With Python

1 Afghanistan Asia 1957 30.332 9240934 820.853030

2 Afghanistan Asia 1962 31.997 10267083 853.100710

3 Afghanistan Asia 1967 34.020 11537966 836.197138

4 Afghanistan Asia 1972 36.088 13079460 739.981106

DataFrame objects have the attribute shape on. This provides information about the number of rows and columns in the DataFrame:

print(df.shape)

(1704, 6) # rows, cols

To list the names of the columns themselves, use .columns:

print(df.columns)

Index(('country', 'continent', 'year', 'lifeExp',

'pop', 'gdpPercap'), dtype="object")

DataFrames in Pandas work similarly to those in other languages, such as Julia and R. Each column – or Series – must be of the same type, while lines can contain mixed types. In our example this is country-Column always a string and the year-Column always an integer. We can check this by .dtypes to list the data type of each column:

print(df.dtypes)

country object

continent object

year int64

lifeExp float64

pop int64

gdpPercap float64

dtype: object

For an even more detailed breakdown of the types within your DataFrame you can .info() to use:

df.info() # information is written to console, so no print required

RangeIndex: 1704 entries, 0 to 1703

Data columns (total 6 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 country 1704 non-null object

1 continent 1704 non-null object

2 year 1704 non-null int64

3 lifeExp 1704 non-null float64

4 pop 1704 non-null int64

5 gdpPercap 1704 non-null float64

dtypes: float64(2), int64(2), object(2)

memory usage: 80.0+ KB

Each Pandas data type is mapped to a native Python counterpart:

object becomes like a python str-Type treated.
int64 becomes like a int handled in Python. It should be noted that not all Pythonints in int64types can be converted. Anything greater than (2**63)-1 will not be converted.
float64 becomes like a Pythonfloat handled (natively 64-bit).
datetime64 becomes like a datetime.datetime Python object handled. Pandas does not automatically attempt to convert values that look like dates into dates. You have to set this explicitly for specific columns.

Columns and Rows in Pandas

Now that you can load a simple dataset, you’ll also want to inspect its contents. You could do this on the print-Set command – however, most DataFrames are too large for this. A better approach is to only look at a subset of the data – similar to what we’re already doing df.head() have made, but with more control options. By leveraging Python’s existing syntax to create and index slices, Pandas also allows you to create DataFrame excerpts.

Extract columns

To examine individual columns in a Pandas DataFrame, you can extract them by their names, positions, or ranges. For example, if you need a specific column from your data set, you can query it by name using square brackets:

# extract the column "country" into its own dataframe

country_df = df("country")

# show the first five rows

print(country_df.head())

| 0 Afghanistan

| 1 Afghanistan

| 2 Afghanistan

| 3 Afghanistan

| 4 Afghanistan

Name: country, dtype: object

# show the last five rows

print(country_df.tail())

| 1699 Zimbabwe

| 1700 Zimbabwe

| 1701 Zimbabwe

| 1702 Zimbabwe

| 1703 Zimbabwe

| Name: country, dtype: object

If you want to extract multiple columns, pass a list of the relevant column names:

# Looking at country, continent, and year

subset = df(('country', 'continent', 'year'))

print(subset.head())

country continent year

| 0 Afghanistan Asia 1952

| 1 Afghanistan Asia 1957

| 2 Afghanistan Asia 1962

| 3 Afghanistan Asia 1967

| 4 Afghanistan Asia 1972

print(subset.tail())

country continent year

| 1699 Zimbabwe Africa 1987

| 1700 Zimbabwe Africa 1992

| 1701 Zimbabwe Africa 1997

| 1702 Zimbabwe Africa 2002

| 1703 Zimbabwe Africa 2007

Extract rows

If you want to extract rows from a DataFrame, Pandas offers two methods:

.iloc() is the simplest method. It extracts rows based on their position, starting at 0. Consequently, to load the first row in the DataFrame example above, you would df.iloc(0) use. If you want to get a range of a row, you can .iloc() Use in combination with Python’s slicing syntax. For example, for the first ten lines you would df.iloc(0:10) use. If you want to extract specific rows, you can also use a list of row IDs, something like this df.iloc((0,1,2,5,7,10,12)). Note the double brackets – they mean that you specify a list as the first argument.
The other way to extract rows is .loc(). This extracts a subset based on the labels of the rows. By default, the rows are labeled with an increasing integer value (starting with 0). The data can also be labeled manually using the .index-Set the DataFrame’s property. For example, if we wanted to re-index the DataFrame above so that each row has an index in multiples of 100, we could do this df.index = range(0, len(df)*100, 100) use. If we then df.loc(100) we would get the second line.

Extract columns

In case you only want to retrieve a specific subset of columns along with your row slices, pass an appropriate list of columns as the second argument:

df.loc((rows), (columns))

For example, if we want to retrieve only the “Country” and “Year” columns for all rows from the example data set above, we do the following:

df.loc(:, ("country","year"))

The : means “all lines” (that’s Python’s slicing syntax). The list of columns follows the comma. You can also specify columns by position using .iloc use:

df.iloc(:, (0,2))

This also works if you only need the first three columns:

df.iloc(:, 0:3)

All of these approaches can be combined with each other, as long as you loc for labels and column names and iloc Use for numeric indexes. Below we instruct Pandas to extract the first 100 rows based on their numeric labels and then repeat this for the first three columns (based on their indices):

df.loc(0:100).iloc(:, 0:3)

To avoid confusion, it’s a good idea to use actual column names when dividing data. This makes the code easier to read – and you don’t have to go back to the data set to figure out which column corresponds to which index. This also saves you from errors when the columns are reordered.

Calculations with Pandas

Spreadsheets and libraries that work with numbers have methods for generating statistics about data. At this point let’s look at the Gapminder data again:

print(df.head(n=10))

| country continent year lifeExp pop gdpPercap

| 0 Afghanistan Asia 1952 28.801 8425333 779.445314

| 1 Afghanistan Asia 1957 30.332 9240934 820.853030

| 2 Afghanistan Asia 1962 31.997 10267083 853.100710

| 3 Afghanistan Asia 1967 34.020 11537966 836.197138

| 4 Afghanistan Asia 1972 36.088 13079460 739.981106

| 5 Afghanistan Asia 1977 38.438 14880372 786.113360

| 6 Afghanistan Asia 1982 39.854 12881816 978.011439

| 7 Afghanistan Asia 1987 40.822 13867957 852.395945

| 8 Afghanistan Asia 1992 41.674 16317921 649.341395

| 9 Afghanistan Asia 1997 41.763 22227415 635.341351

For example, we could ask the following questions about this data set:

What is the average life expectancy for each year in the data set?
How do we proceed if we want to calculate averages for years and continents?
How can we count how many countries in the dataset belong to each continent?

In order to be able to answer this with Pandas, a “grouped” or “aggregated” calculation is required. We can split the data along specific lines, apply a calculation to each split segment, and then merge the results into a new DataFrame.

Count grouped averages

The first method we use for this is Pandas’ df.groupby() operation. To do this, we specify a column according to which we want to divide the data:

df.groupby("year")

This allows us to treat all rows with the same year value as a separate object from the DataFrame. Given these assumptions, we can use the “Life Expectancy” (lifeExp) column and calculate its average for each included year:

print(df.groupby('year')('lifeExp').mean())

year

1952 49.057620

1957 51.507401

1962 53.609249

1967 55.678290

1972 57.647386

1977 59.570157

1982 61.533197

1987 63.212613

1992 64.160338

1997 65.014676

2002 65.694923

2007 67.007423

This results in the average life expectancy for all population groups, broken down by year. We could use a similar calculation for population (pop) and GDP per year (gdpPercap):

print(df.groupby('year')('pop').mean())

print(df.groupby('year')('gdpPercap').mean())

However, if we want to group our data by more than one column, we pass columns into lists:

print(df.groupby(('year', 'continent'))

(('lifeExp', 'gdpPercap')).mean())

lifeExp gdpPercap

year continent

1952 Africa 39.135500 1252.572466

Americas 53.279840 4079.062552

Asia 46.314394 5195.484004

Europe 64.408500 5661.057435

Oceania 69.255000 10298.085650

1957 Africa 41.266346 1385.236062

Americas 55.960280 4616.043733

Asia 49.318544 5787.732940

Europe 66.703067 6963.012816

Oceania 70.295000 11598.522455

1962 Africa 43.319442 1598.078825

Americas 58.398760 4901.541870

Asia 51.563223 5729.369625

Europe 68.539233 8365.486814

Oceania 71.085000 12696.452430

This .groupby()-Operation groups our data first by year and then by continent. Then average values are formed from the life expectancy and GDP columns. This allows you to group your data and specify how it should be presented and in what order it should be calculated.

If you want to “smooth” the results into a single, incrementally indexed framework, you can use the .reset_index()-Apply method to results:

gb = df.groupby(('year', 'continent'))

(('lifeExp', 'gdpPercap')).mean()

flat = gb.reset_index()

print(flat.head())

| year continent lifeExp gdpPercap

| 0 1952 Africa 39.135500 1252.572466

| 1 1952 Americas 53.279840 4079.062552

| 2 1952 Asia 46.314394 5195.484004

| 3 1952 Europe 64.408500 5661.057435

| 4 1952 Oceana 69.255000 10298.085650

Count grouped frequencies

Another common use case for data is frequency calculations. The methods can be used to determine unique values in a series – and their frequency nunique and value_counts use. For example, you can find out how many countries belong to each continent (see initial questions):

print(df.groupby('continent')('country').nunique())

continent

Africa 52

Americas 25

Asia 33

Europe 30

Oceana 2

Visualization basics with Pandas and Matplotlib

When it comes to visualizing data, in most cases another library is used – for example Matplotlib. You can also use this library to create data visualizations directly from Pandas. To use the simple Matplotlib extension for Pandas, first make sure you have Matplotlib installed:

pip install matplotlib

Let’s now take another look at the example data – and here at the average annual life expectancy of the world population:

global_yearly_life_expectancy = df.groupby('year')('lifeExp').mean()

print(global_yearly_life_expectancy)

| year

| 1952 49.057620

| 1957 51.507401

| 1962 53.609249

| 1967 55.678290

| 1972 57.647386

| 1977 59.570157

| 1982 61.533197

| 1987 63.212613

| 1992 64.160338

| 1997 65.014676

| 2002 65.694923

| 2007 67.007423

| Name: lifeExp, dtype: float64

To create a simple visualization from this data set:

import matplotlib.pyplot as plt

global_yearly_life_expectancy = df.groupby('year')('lifeExp').mean()

c = global_yearly_life_expectancy.plot().get_figure()

plt.savefig("output.png")

The associated diagram is saved to a file in the current working directory as output.png saved. The axes and other labels in the diagram can be set manually. This method is also good for quick exports. (fm)

This article originally appeared at our sister publication Infoworld.com.