Pandas Library Notes

20 minute read

This is a summary of the most useful pandas features and is based on some of the techniques discussed in Ted Petrou’s Minimally Sufficient Pandas article.

Selecting a column of data

For this example, imagine that we have a DataFrame df with columns name, age, gender, home address and count.

To select the age column as a Series, use df['age'].

The bracket notation will work even in these scenarios:

  • spaces are present in the column name, for example, home address, use df['home address']
  • column name matches a DataFrame method name, say count, use df['count']
  • column name is defined in a variable, for example, col = 'age', use df[col]

Selecting rows and columns using labels (loc) or integer location (iloc)

For the following examples we are going to use a CSV file of 500 made-up US addresses downloaded from https://www.briandunning.com/sample-data/

In : df = pd.read_csv('us-500.csv')

In : df.columns
Out:
Index(['first_name', 'last_name', 'company_name', 'address', 'city', 'county',
       'state', 'zip', 'phone1', 'phone2', 'email', 'web'],
      dtype='object')

Selecting using labels

Let us first set the label of each row to be the respective content in the last_name column.

In : df.set_index('last_name', inplace=True)

To select using labels use the DataFrame loc method. For example, to select the row with label Venere use df.loc['Venere']. This will return a Series as follows.

In : df.loc['Venere']
Out:
first_name                                 Art
company_name               Chemel, James L Cpa
address                   8 W Cerritos Ave #54
city                                Bridgeport
county                              Gloucester
state                                       NJ
zip                                       8014
phone1                            856-636-8749
phone2                            856-264-4130
email                           art@venere.org
web             http://www.chemeljameslcpa.com
Name: Venere, dtype: object

To return a DataFrame instead, use df.loc[['Venere']], the list syntax.

In :  df.loc[['Venere']]
Out:
          first_name         company_name               address        city  \
last_name
Venere           Art  Chemel, James L Cpa  8 W Cerritos Ave #54  Bridgeport

               county state   zip        phone1        phone2           email  \
last_name
Venere     Gloucester    NJ  8014  856-636-8749  856-264-4130  art@venere.org

                                      web
last_name
Venere     http://www.chemeljameslcpa.com

To select rows with labels Venere and Foller and return only first_name, company_name, city and state, use df.loc[['Venere','Foller'],['first_name','company_name','city','state']].

In : df.loc[['Venere','Foller'],['first_name','company_name','city','state']]
Out:
          first_name         company_name        city state
last_name
Venere           Art  Chemel, James L Cpa  Bridgeport    NJ
Foller       Donette  Printing Dimensions    Hamilton    OH

We can also use the slicing syntax, i.e. using the a:b syntax, to retrieve only rows and columns that match or are in between the labels specified. For example, to retrieve all records between Venere and Foller inclusive, and only get the contact information, use df.loc['Venere':'Foller','phone1':'email'].

In : df.loc['Venere':'Foller','phone1':'email']
Out:
                 phone1        phone2                   email
last_name
Venere     856-636-8749  856-264-4130          art@venere.org
Paprocki   907-385-4412  907-921-2010   lpaprocki@hotmail.com
Foller     513-570-1893  513-549-4561  donette.foller@cox.net

Selecting using conditions

To return all records that match some specified criteria, say all people from Washington state, WA, and return specific properties, for example, their first name and phone number, use a boolean Series with loc, as follows.

In : df.loc[df['state']=='WA', ['first_name','phone1']]
Out:
              first_name        phone1
last_name
Bartolet            Glen  206-697-5796
Loader              Alex  253-660-7821
Hellickson        Dottie  206-540-6076
Engelberg         Johnna  425-986-7573
Heintzman         Samira  206-311-4137
Pagliuca          Laurel  509-695-5199
Hoopengardner      Ettie  509-755-5393
Biddy               Jani  206-711-6498

Selecting using row/column numbers

The DataFrame iloc method should be used to select rows and columns based on their current numeric position within the table. For instance, to select the first row use df.iloc[[0]]. To select the last row, use df.iloc[[-1]].

In : df.iloc[[0]]
Out:
          first_name       company_name             address         city  \
last_name
Butt           James  Benton, John B Jr  6649 N Blue Gum St  New Orleans

            county state    zip        phone1        phone2            email  \
last_name
Butt       Orleans    LA  70116  504-621-8927  504-845-1427  jbutt@gmail.com

                                    web
last_name
Butt       http://www.bentonjohnbjr.com

In :df.iloc[[-1]]
Out:
          first_name                company_name         address     city  \
last_name
Motley      Chauncey  Affiliated With Travelodge  63 E Aurora Dr  Orlando

           county state    zip        phone1        phone2  \
last_name
Motley     Orange    FL  32804  407-413-4842  407-557-8857

                             email                                      web
last_name
Motley     chauncey_motley@aol.com  http://www.affiliatedwithtravelodge.com

You can also use the list syntax to choose particular rows and columns. For example, to select the first three odd numbered rows and return their first three even numbered columns, we would use df.iloc[[1,3,5],[2,4,6]].

In : df.iloc[[1,3,5],[2,4,6]]
Out:
                       address      county    zip
last_name
Darakjy    4 B Blue Ridge Blvd  Livingston  48116
Paprocki           639 Main St   Anchorage  99501
Morasca           3 Mcauley Dr     Ashland  44805

Slicing is also possible with the iloc method. For example, retrieving the first five records and their respective contact information found in columns 8 to 10 is done as follows. Keep in mind that indexing is zero based, so we need to specify 7:10. Also note that 10 is specified since that last element is exluced and we need up to 9.

In : df.iloc[:5,7:10]
Out:
                 phone1        phone2                          email
last_name
Butt       504-621-8927  504-845-1427                jbutt@gmail.com
Darakjy    810-292-9388  810-374-9840  josephine_darakjy@darakjy.org
Venere     856-636-8749  856-264-4130                 art@venere.org
Paprocki   907-385-4412  907-921-2010          lpaprocki@hotmail.com
Foller     513-570-1893  513-549-4561         donette.foller@cox.net

Loading delimited text files

Whenever you are loading delimited text files always use the read_csv DataFrame method. Just take care to specify the delimiter if it is not a comma.

So for a comma delimited file, df = pd.read_csv('filename.csv') or df = pd.read_csv('filename.csv', delimiter=',') are equivalent.

Checking for NA entries

To determine which entries in a DataFrame are NaN, None or NaT use the isna DataFrame method.

In the following example we first make a new DataFrame using the first five records from the original 500 entries DataFrame. Then we set a couple of entries to have NaT values in the zip field. Finally, we use the loc DataFrame method to conditionally filter records using the isna method.

In : mini_df = df.loc[:4,:].copy()

In : mini_df.loc[:,'zip']
Out:
0    70116
1    48116
2     8014
3    99501
4    45011
Name: zip, dtype: int64
In : mini_df.loc[[0,2], 'zip'] = pd.NaT

In : mini_df.loc[:,'zip']
Out:
0      NaT
1    48116
2      NaT
3    99501
4    45011
In : mini_df.loc[mini_df.loc[:,'zip'].isna(), ['first_name', 'last_name', 'zip']]
Out:
  first_name last_name  zip
0      James      Butt  NaT
2        Art    Venere  NaT

If we instead want to view those entries which are not equal to NA, we can either use the notna method or the inversion operator ~.

In : mini_df.loc[~mini_df.loc[:,'zip'].isna(), ['first_name', 'last_name', 'zip']]
Out:
  first_name last_name    zip
1  Josephine   Darakjy  48116
3      Lenna  Paprocki  99501
4    Donette    Foller  45011
In : mini_df.loc[mini_df.loc[:,'zip'].notna(), ['first_name', 'last_name', 'zip']]
Out:
  first_name last_name    zip
1  Josephine   Darakjy  48116
3      Lenna  Paprocki  99501
4    Donette    Foller  45011

Performing arithmetic or comparison operations

Always use the Python arithmetic and comparison operators unless you need to specify something like alignment. The following is an example where the pandas DataFrame arithmetic and comparison methods have to be used.

In : product_cost_split = pd.DataFrame(np.array([[0.4, 0.3, 0.3],
...:                                             [0.3, 0.5, 0.2],
...:                                             [0.3, 0.3, 0.4]]),
...:                                   columns=['design','material','production'],
...:                                   index=['product_a','product_b','product_c'])
...:

In : product_cost_split
Out:
           design  material  production
product_a     0.4       0.3         0.3
product_b     0.3       0.5         0.2
product_c     0.3       0.3         0.4
In : product_cost = pd.Series({'product_a': 320, 'product_b':474, 'product_c':281})

In : product_cost
Out:
product_a    320
product_b    474
product_c    281
dtype: int64

If we want to compute the exact cost of design, material and production for each product, we need to multiply each product row with the corresponding total cost value in the Series. Using the Python multiplication operator gives a wrong and very unexpected result. This happens because pandas in this case assumes the wrong alignment.

In : product_cost_split * product_cost
Out:
           design  material  product_a  product_b  product_c  production
product_a     NaN       NaN        NaN        NaN        NaN         NaN
product_b     NaN       NaN        NaN        NaN        NaN         NaN
product_c     NaN       NaN        NaN        NaN        NaN         NaN

To get the correct results, we need to use the pandas DataFrame mul operator and specify the axis, so that pandas aligns the index of the DataFrame to the index in the Series.

In : product_cost_split.mul(product_cost, axis='index')
Out:
           design  material  production
product_a   128.0      96.0        96.0
product_b   142.2     237.0        94.8
product_c    84.3      84.3       112.4

Computing sum, min, max, and abs

The pandas DataFrame methods sum, min, max and abs should always be used to compute these values.

Although Python has its own methods of the same name, their performance is much slower than the DataFrame methods. The reason for this is that the DataFrame methods are C functions, while the Python methods execute a for loop.

The following example compares the performance of min and sum on a 100,000 element Series. The pandas methods are much faster, min being over 7 times faster than the Python built-in method, while sum is 5 times faster than the Python built-in method.

In : s = pd.Series(np.random.rand(10** 5))

In : %timeit s.min()
866 µs ± 10.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In : %timeit min(s)
6.74 ms ± 173 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In : %timeit s.sum()
937 µs ± 55 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In : %timeit sum(s)
4.71 ms ± 134 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Aggregating using groupby

To summarise a DataFrame, for example, to count how many items of each type are present in a data set or to compute the mininum, maximum and median price of products in a data set, we can use the groupby and agg functions as follows.

df.groupby('grouping column').agg({'aggregating column': 'aggregating function'})

If you need to aggregate multiple columns and for each compute various statistics, extend the dictionary in the agg function as follows.

df.groupby('grouping column').agg({'aggregating column 1': ['aggregating function 1', 'aggregating function 2'],
				   'aggregating column 2': ['aggregating function 1', 'aggregating function 2'],
				   'aggregating column 3': 'aggregating function'})

For the following example we are going to use a CSV file of 500 made-up US addresses downloaded from https://www.briandunning.com/sample-data/. Let us say we want to determine the top five states by number of people present in the data set. To do this we need to group by state and then use the count aggregate function. Here we are also sorting in descending order and presenting only the top five entries.

In : ppl = pd.read_csv('us-500.csv')

In : ppl.groupby('state').agg({'state':'count'}).sort_values('state', ascending=False).head(5)
Out:
       state
state
CA        72
NJ        52
NY        46
TX        32
PA        29

Working with a MultiIndex

Certain group by and aggregation operations will result in what is called a multi-index. For example, using the 500 made-up US addresses data set, we might want to group by state and city so as to count the number of individuals residing within each city. This can be done as follows.

In : ppl_city_counts = ppl.groupby(['state','city']).agg({'city':'count'})

In : ppl_city_counts.head(10)
Out:
                   city
state city
AK    Anchorage       4
      Fairbanks       2
AR    Little Rock     1
AZ    Mesa            1
      Peoria          1
      Phoenix         5
      Scottsdale      2
CA    Anaheim         1
      Bellflower      1
      Berkeley        1

As you can see, the index is now a multi-index, composed of state and city. While this is easy to read it requires different syntax to make subset selections. For instance, to view number of residents in the cities of AK and AZ states only, we can use IndexSlice, as follows.

In : ppl_city_counts.loc[pd.IndexSlice[['AK','AZ'], :], :]
Out:
                  city
state city
AK    Anchorage      4
      Fairbanks      2
AZ    Mesa           1
      Peoria         1
      Phoenix        5
      Scottsdale     2

Another example could be selecting all Philadelphia residents, as follows.

In : ppl_city_counts.loc[pd.IndexSlice[:, ['Philadelphia']], :]
Out:
                    city
state city
PA    Philadelphia     8

If you find this cumbersome to work with, you can always flatten the index and rename the columns to get a normal table layout with a single-level index.

In : ppl_city_counts.columns = ['residents count']

In : ppl_city_counts.reset_index(inplace=True)

In : ppl_city_counts.head(5)
Out:
  state         city  residents count
0    AK    Anchorage                4
1    AK    Fairbanks                2
2    AR  Little Rock                1
3    AZ         Mesa                1
4    AZ       Peoria                1

Then, filtering on a single-level index DataFrame to get the same results as in the previous two filtering examples would be done like so.

In : ppl_city_counts.loc[ppl_city_counts.state.isin(['AK','AZ'])]
Out:
  state        city  residents count
0    AK   Anchorage                4
1    AK   Fairbanks                2
3    AZ        Mesa                1
4    AZ      Peoria                1
5    AZ     Phoenix                5
6    AZ  Scottsdale                2
In : ppl_city_counts.loc[ppl_city_counts.city.isin(['Philadelphia'])]
Out:
    state          city  residents count
294    PA  Philadelphia                8

Pivot tables and cross tabulations (contingency table)

While there is some equivalence between groupby, pivot_table and pandas.crosstab for certain queries, it is best to stick to groupby, except for the following two scenarios.

  • If you want to generate a table to compare values across columns, use pivot_table because the output is easier to compare. On the other hand, if you want to further process the data, use groupby.
  • If you want to show the frequency between two variables, say gender counts in a university across races, use the pandas.crosstab function. Furthermore, you can easily output relative frequencies by setting the parameter normalize='columns'.