1 Afghanistan Asia 1957 30.332 9240934 820.853030
2 Afghanistan Asia 1962 31.997 10267083 853.100710
3 Afghanistan Asia 1967 34.020 11537966 836.197138
4 Afghanistan Asia 1972 36.088 13079460 739.981106
DataFrame objects have the attribute shape on. This provides information about the number of rows and columns in the DataFrame:
print(df.shape)
(1704, 6) # rows, cols
To list the names of the columns themselves, use .columns:
print(df.columns)
Index(('country', 'continent', 'year', 'lifeExp',
'pop', 'gdpPercap'), dtype="object")
DataFrames in Pandas work similarly to those in other languages, such as Julia and R. Each column – or Series – must be of the same type, while lines can contain mixed types. In our example this is country-Column always a string and the year-Column always an integer. We can check this by .dtypes to list the data type of each column:
print(df.dtypes)
country object
continent object
year int64
lifeExp float64
pop int64
gdpPercap float64
dtype: object
For an even more detailed breakdown of the types within your DataFrame you can .info() to use:
df.info() # information is written to console, so no print required
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 country 1704 non-null object
1 continent 1704 non-null object
2 year 1704 non-null int64
3 lifeExp 1704 non-null float64
4 pop 1704 non-null int64
5 gdpPercap 1704 non-null float64
dtypes: float64(2), int64(2), object(2)
memory usage: 80.0+ KB
Each Pandas data type is mapped to a native Python counterpart:
-
objectbecomes like a pythonstr-Type treated. -
int64becomes like ainthandled in Python. It should be noted that not all Pythonints inint64types can be converted. Anything greater than (2**63)-1 will not be converted. -
float64becomes like a Pythonfloathandled (natively 64-bit). -
datetime64becomes like adatetime.datetimePython object handled. Pandas does not automatically attempt to convert values that look like dates into dates. You have to set this explicitly for specific columns.
Columns and Rows in Pandas
Now that you can load a simple dataset, you’ll also want to inspect its contents. You could do this on the print-Set command – however, most DataFrames are too large for this. A better approach is to only look at a subset of the data – similar to what we’re already doing df.head() have made, but with more control options. By leveraging Python’s existing syntax to create and index slices, Pandas also allows you to create DataFrame excerpts.
Extract columns
To examine individual columns in a Pandas DataFrame, you can extract them by their names, positions, or ranges. For example, if you need a specific column from your data set, you can query it by name using square brackets:
# extract the column "country" into its own dataframe
country_df = df("country")
# show the first five rows
print(country_df.head())
| 0 Afghanistan
| 1 Afghanistan
| 2 Afghanistan
| 3 Afghanistan
| 4 Afghanistan
Name: country, dtype: object
# show the last five rows
print(country_df.tail())
| 1699 Zimbabwe
| 1700 Zimbabwe
| 1701 Zimbabwe
| 1702 Zimbabwe
| 1703 Zimbabwe
| Name: country, dtype: object
If you want to extract multiple columns, pass a list of the relevant column names:
# Looking at country, continent, and year
subset = df(('country', 'continent', 'year'))
print(subset.head())
country continent year
| 0 Afghanistan Asia 1952
| 1 Afghanistan Asia 1957
| 2 Afghanistan Asia 1962
| 3 Afghanistan Asia 1967
| 4 Afghanistan Asia 1972
print(subset.tail())
country continent year
| 1699 Zimbabwe Africa 1987
| 1700 Zimbabwe Africa 1992
| 1701 Zimbabwe Africa 1997
| 1702 Zimbabwe Africa 2002
| 1703 Zimbabwe Africa 2007
Extract rows
If you want to extract rows from a DataFrame, Pandas offers two methods:
-
.iloc()is the simplest method. It extracts rows based on their position, starting at 0. Consequently, to load the first row in the DataFrame example above, you woulddf.iloc(0)use. If you want to get a range of a row, you can.iloc()Use in combination with Python’s slicing syntax. For example, for the first ten lines you woulddf.iloc(0:10)use. If you want to extract specific rows, you can also use a list of row IDs, something like thisdf.iloc((0,1,2,5,7,10,12)). Note the double brackets – they mean that you specify a list as the first argument. -
The other way to extract rows is
.loc(). This extracts a subset based on the labels of the rows. By default, the rows are labeled with an increasing integer value (starting with 0). The data can also be labeled manually using the.index-Set the DataFrame’s property. For example, if we wanted to re-index the DataFrame above so that each row has an index in multiples of 100, we could do thisdf.index = range(0, len(df)*100, 100)use. If we thendf.loc(100)we would get the second line.
Extract columns
In case you only want to retrieve a specific subset of columns along with your row slices, pass an appropriate list of columns as the second argument:
df.loc((rows), (columns))
For example, if we want to retrieve only the “Country” and “Year” columns for all rows from the example data set above, we do the following:
df.loc(:, ("country","year"))
The : means “all lines” (that’s Python’s slicing syntax). The list of columns follows the comma. You can also specify columns by position using .iloc use:
df.iloc(:, (0,2))
This also works if you only need the first three columns:
df.iloc(:, 0:3)
All of these approaches can be combined with each other, as long as you loc for labels and column names and iloc Use for numeric indexes. Below we instruct Pandas to extract the first 100 rows based on their numeric labels and then repeat this for the first three columns (based on their indices):
df.loc(0:100).iloc(:, 0:3)
To avoid confusion, it’s a good idea to use actual column names when dividing data. This makes the code easier to read – and you don’t have to go back to the data set to figure out which column corresponds to which index. This also saves you from errors when the columns are reordered.
Calculations with Pandas
Spreadsheets and libraries that work with numbers have methods for generating statistics about data. At this point let’s look at the Gapminder data again:
print(df.head(n=10))
| country continent year lifeExp pop gdpPercap
| 0 Afghanistan Asia 1952 28.801 8425333 779.445314
| 1 Afghanistan Asia 1957 30.332 9240934 820.853030
| 2 Afghanistan Asia 1962 31.997 10267083 853.100710
| 3 Afghanistan Asia 1967 34.020 11537966 836.197138
| 4 Afghanistan Asia 1972 36.088 13079460 739.981106
| 5 Afghanistan Asia 1977 38.438 14880372 786.113360
| 6 Afghanistan Asia 1982 39.854 12881816 978.011439
| 7 Afghanistan Asia 1987 40.822 13867957 852.395945
| 8 Afghanistan Asia 1992 41.674 16317921 649.341395
| 9 Afghanistan Asia 1997 41.763 22227415 635.341351
For example, we could ask the following questions about this data set:
-
What is the average life expectancy for each year in the data set?
-
How do we proceed if we want to calculate averages for years and continents?
-
How can we count how many countries in the dataset belong to each continent?
In order to be able to answer this with Pandas, a “grouped” or “aggregated” calculation is required. We can split the data along specific lines, apply a calculation to each split segment, and then merge the results into a new DataFrame.
Count grouped averages
The first method we use for this is Pandas’ df.groupby() operation. To do this, we specify a column according to which we want to divide the data:
df.groupby("year")
This allows us to treat all rows with the same year value as a separate object from the DataFrame. Given these assumptions, we can use the “Life Expectancy” (lifeExp) column and calculate its average for each included year:
print(df.groupby('year')('lifeExp').mean())
year
1952 49.057620
1957 51.507401
1962 53.609249
1967 55.678290
1972 57.647386
1977 59.570157
1982 61.533197
1987 63.212613
1992 64.160338
1997 65.014676
2002 65.694923
2007 67.007423
This results in the average life expectancy for all population groups, broken down by year. We could use a similar calculation for population (pop) and GDP per year (gdpPercap):
print(df.groupby('year')('pop').mean())
print(df.groupby('year')('gdpPercap').mean())
However, if we want to group our data by more than one column, we pass columns into lists:
print(df.groupby(('year', 'continent'))
(('lifeExp', 'gdpPercap')).mean())
lifeExp gdpPercap
year continent
1952 Africa 39.135500 1252.572466
Americas 53.279840 4079.062552
Asia 46.314394 5195.484004
Europe 64.408500 5661.057435
Oceania 69.255000 10298.085650
1957 Africa 41.266346 1385.236062
Americas 55.960280 4616.043733
Asia 49.318544 5787.732940
Europe 66.703067 6963.012816
Oceania 70.295000 11598.522455
1962 Africa 43.319442 1598.078825
Americas 58.398760 4901.541870
Asia 51.563223 5729.369625
Europe 68.539233 8365.486814
Oceania 71.085000 12696.452430
This .groupby()-Operation groups our data first by year and then by continent. Then average values are formed from the life expectancy and GDP columns. This allows you to group your data and specify how it should be presented and in what order it should be calculated.
If you want to “smooth” the results into a single, incrementally indexed framework, you can use the .reset_index()-Apply method to results:
gb = df.groupby(('year', 'continent'))
(('lifeExp', 'gdpPercap')).mean()
flat = gb.reset_index()
print(flat.head())
| year continent lifeExp gdpPercap
| 0 1952 Africa 39.135500 1252.572466
| 1 1952 Americas 53.279840 4079.062552
| 2 1952 Asia 46.314394 5195.484004
| 3 1952 Europe 64.408500 5661.057435
| 4 1952 Oceana 69.255000 10298.085650
Count grouped frequencies
Another common use case for data is frequency calculations. The methods can be used to determine unique values in a series – and their frequency nunique and value_counts use. For example, you can find out how many countries belong to each continent (see initial questions):
print(df.groupby('continent')('country').nunique())
continent
Africa 52
Americas 25
Asia 33
Europe 30
Oceana 2
Visualization basics with Pandas and Matplotlib
When it comes to visualizing data, in most cases another library is used – for example Matplotlib. You can also use this library to create data visualizations directly from Pandas. To use the simple Matplotlib extension for Pandas, first make sure you have Matplotlib installed:
pip install matplotlib
Let’s now take another look at the example data – and here at the average annual life expectancy of the world population:
global_yearly_life_expectancy = df.groupby('year')('lifeExp').mean()
print(global_yearly_life_expectancy)
| year
| 1952 49.057620
| 1957 51.507401
| 1962 53.609249
| 1967 55.678290
| 1972 57.647386
| 1977 59.570157
| 1982 61.533197
| 1987 63.212613
| 1992 64.160338
| 1997 65.014676
| 2002 65.694923
| 2007 67.007423
| Name: lifeExp, dtype: float64
To create a simple visualization from this data set:
import matplotlib.pyplot as plt
global_yearly_life_expectancy = df.groupby('year')('lifeExp').mean()
c = global_yearly_life_expectancy.plot().get_figure()
plt.savefig("output.png")
The associated diagram is saved to a file in the current working directory as output.png saved. The axes and other labels in the diagram can be set manually. This method is also good for quick exports. (fm)
This article originally appeared at our sister publication Infoworld.com.
