If you want to analyze data in Python, you’ll want to become familiar with pandas, as it makes data analysis so much easier. The DataFrame is the primary data format you’ll interact with. Here’s how to make use of it.
What is pandas?
pandas is a Python module that’s popular in data science and data analysis. It’s offers a way to organize data into DataFrames and offers lots of operations you can perform on this data. It was originally developed by AQR Capital Management, but it was open-sourced in the late 2000s.
To install pandas using PyPI:
pip install pandas
It’s best to work with pandas using a Jupyter notebook or other interactive Python session. IPython is great for casual explorations of data in the terminal, but Jupyter will save a record of your calculations, which is helpful when you return to a dataset days or weeks later and struggle to remember what you did. I’ve created my own notebook of code examples you can examine on my GitHub page. That’s where the screenshots you’ll see came from.
What is a DataFrame?
A DataFrame is the primary data structure that you work with in pandas. Like a spreadsheet or relational database, it organizes data into rows and columns. Columns are grouped by a header name. The concept is similar to R data frames, another programming language popular in statistics and data science. DataFrame columns can hold both text and numeric data, including integers and floating-point numbers. Columns can also contain time series data.
How to Create a DataFrame
Assuming you already have pandas installed, you can create a small DataFrame from other elements.
I’ll create columns representing a linear function that could be used for regression analysis later. First, I’ll create the x-axis, or the independent variable, from a NumPy array:
import numpy as np
x = np.linspace(-10,10)
Next, I’ll create the y column or dependent variable as a simple linear function:
y = 2*x + 5
Now I’ll import pandas and create the DataFrame.
import pandas as pd
As with NumPy, shortening the name of pandas will make it easier to type.
pandas’ DataFrame method takes a dictionary of the names of the columns and the lists of the actual data. I’ll create a DataFrame named “df” with columns labeled “x” and “y.” The data will be the NumPy arrays I created earlier.
df = pd.DataFrame({'x':x,'y':y})
Importing a DataFrame
While it’s possible to create DataFrames from scratch, it’s more common to import the data from another source. Because the DataFrame content is tabular, spreadsheets are a popular source. The top values of the spreadsheet will become the column names.
To read in an Excel spreadsheet, use the read_excel method:
df = pd.read_excel('/path/to/spreadsheet.xls')
Being an open-source fan, I tend to gravitate toward LibreOffice Calc rather than Excel, but I can also import other file types. The .csv format is widely used, and I can export my data in that format.
df = pd.read_csv('/path/to/data.csv')
A handy feature is the ability to copy from the clipboard. This is great for smaller datasets to get to more advanced calculations than I can get in a spreadsheet:
df = pd.read_clipboard()
Examining a DataFrame
Now that you’ve created a DataFrame, the next step is to examine the data in it.
One way to do that is to get the first five rows of the DataFrame with the head method
df.head()
I’ve you’ve ever used the head command on Linux or other Unix-like systems, this is similar. If you know about the tail command, there’s a similar method in pandas that gets the last lines of a DataFrame
df.tail()
You can use array slicing methods to view a precise subset of lines. To view lines 1 through 3:
df[1:3]
With the head command in Linux, you can view an exact number of lines with a numerical arguement. You can do the same thing in pandas. To see the first 10 lines:
df.head(10)
The tail method works in a similar fashion.
df.tail(10)
More interesting is to examine existing datasets. A popular way to demonstrate this is with the dataset of passengers on the Titanic. It’s available on Kaggle. A lot of other statistical libraries like Seaborn and Pingouin will let you load in example datasets so you don’t have to download them. pandas DataFrames will also mostly be used for feeding data into these libraries, such as to make a plot or calculate a linear regression.
With the data downloaded, you’ll have to import it:
titanic = pd.read_csv('data/Titanic-Dataset.csv')
Let’s look at the head again
titanic.head()
We can also see all the columns with the columns method
titanic.columns
pandas offers a lot of methods for getting info about the dataset. The describe method offers some descriptive statistics of all the numerical columns in the DataFrame.
titanic.describe()
First is the mean, or average. Next is the standard deviation, or how close or tightly the values are spaced around the mean. Next comes the minimum value, the lower quartile or the 25th percentile, the median, or 50th percentile, the upper quartile or 75th percentile, and the maximum value. These values make up legendary statistician John Tukey’s “five-number summary.” You can quickly see how your data is distributed using these numbers.
To access a column by itself, call the name of the DataFrame with the name of the column in square brackets(‘[]’)
For example, to view the column with the name of the passengers:
titanic['Name']
Because the list is so long, it will be truncated by default. To see the entire list of names, use the to_string method.
titanic['Name'].to_string()
You can also turn truncation off. To turn it off with columns with a large number of rows:
pd.set_option('display.max_rows', None)
You can also use other methods when selecting by row. To see the descriptive statistics on one column:
titanic['Age'].describe()
You can also access individual values
titanic['Age'].mean()
titanic['Age'].median()
Adding and Deleting Columns
Not only can you examine columns, you can add new ones as well. You can add a column a populate it with values, as you would with a Python array, but you can also transform data and add it to new columns.
Let’s go back to the original DataFrame we created, df. We can perform operations on every element in a column. For example, to square the x column:
df['x']**2
We can create a new column with these values:
df['x2'] = df['x']**2
To delete a column, you can use the drop function
df.drop('x2',axis=1)
The axis argument tells pandas to operate by columns instead of rows.
Performing Operations on Columns
As alluded to earlier, you can perform operations on columns. You can perform mathematical and statistical operations on them.
We can add our x and y columns together:
df['x'] + df['y']
You can select multiple columns with double brackets.
To see the names and ages of the Titanic passengers:
titanic[['Name','Age']]
The column elements must be separated by a comma (,) character.
You can also search pandas DataFrames, similar to SQL searches. To see the rows of passengers who were older than 30 when they boarded the ill-fated liner, you can use a Boolean selection inside the brackets:
titanic[titanic['Age'] > 30]
This is like the SQL statement:
SELECT * FROM titanic WHERE Age > 30
You can select the column by using .loc before the brackets:
titanic.loc [titanic['Age'] > 30]
Let’s make a bar plot of where the Titanic passengers embarked. We can make our own subset of the DataFrame with the three points of embarkation, Southampton, England; Cherbourg, France; and Queenstown, Ireland (now Cobh).
embarked = titanic['Embarked'].value_counts()
This will create a new DataFrame with the number of people who embarked at each port. But we have a problem. The column headers are simply letters standing for the name of the port. Let’s replace them with the full names of the port. The rename method will take a dictionary of the old names and the new ones.
embarked = embarked.rename({'S':'Southhampton','C':'Cherbourg','Q':'Queenstown'})
With the columns renamed, we can make our bar chart. This is easy with pandas:
embarked.plot(kind='bar')
This should help you get started exploring pandas datasets. pandas is one reason that Python has become so popular with statisticians, data scientists, and anyone who needs to explore data.