python pandas Data Science

Pandas: Filtering and segmenting

This article is part of a series of practical guides for using the Python data processing library pandas. To see view all the available parts, click here.

One of the most common ways you will interact with a pandas DataFrame is by selecting different combinations of columns and rows. This can be done using the numerical positions of columns and rows in the DataFrame, column names and row indices, or by filtering the rows by applying some criteria to the data in the DataFrame. All of these options (and combinations of them) are available, so let’s dig in!

Reading in a dataset

If you don’t have a dataset you want to play around with, University of California Irvine has an excellent online repository of datasets that you can play with. For this explainer we are going to be using the Wine Quality dataset. If you want to follow along, you can import the dataset as follows:

import pandas as pd

df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv", sep=';')

Selecting columns

Selecting columns in pandas is about as straight forward as it gets, but there are a few options worth covering. First, let’s keep it simple:

df['quality']

0       5
1       5
2       5
3       6
4       5
       ..
1594    5
1595    6
1596    6
1597    5
1598    6
Name: quality, Length: 1599, dtype: int64

By putting the name of the column we want to select inside square brackets and quotes (‘ and ” both work), we return a pandas Series. There is an alternative which returns the same results:

df.quality

0       5
1       5
2       5
3       6
4       5
       ..
1594    5
1595    6
1596    6
1597    5
1598    6
Name: quality, Length: 1599, dtype: int64

However, I recommend you do not use dot notation for several reasons:

  1. It will only work if the column name does not have any special characters. If, for example, your column name has a space in it (e.g. 'fixed acidity'), you won’t be able to use dot notation.
  2. If your column name is the same as a DataFrame method name (e.g. you have a column called ‘sum’), using the dot notation will call the method instead of selecting the column.
  3. In more complicated scenarios you will often need to select columns dynamically. That is you will have column names that you store to a variable, then use that variable to access the column. This only works with the bracket notation.

What if we want to select multiple columns? Instead of passing one column name, we are going to pass a list of column names:

df[['fixed acidity', 'pH', 'quality']]

      fixed acidity    pH  quality
0               7.4  3.51        5
1               7.8  3.20        5
2               7.8  3.26        5
3              11.2  3.16        6
4               7.4  3.51        5
...             ...   ...      ...
1594            6.2  3.45        5
1595            5.9  3.52        6
1596            6.3  3.42        6
1597            5.9  3.57        5
1598            6.0  3.39        6

[1599 rows x 3 columns]

You might notice that we now have double square brackets. To understand why we need them, let’s reorganize our code a little:

columns_to_extract = ['fixed acidity', 'pH', 'quality']
df[columns_to_extract]

As this hopefully makes clearer, one set of square brackets is the syntax for selecting columns from a DataFrame in pandas, the other set is needed to create the list of column names.

What if we don’t know the column names or just want to select columns based on their position in the DataFrame rather than their name? Here we can use iloc or “integer-location”:

df.iloc[:, 0]

0        7.4
1        7.8
2        7.8
3       11.2
4        7.4
        ... 
1594     6.2
1595     5.9
1596     6.3
1597     5.9
1598     6.0
Name: fixed acidity, Length: 1599, dtype: float64

To explain this syntax, first let’s understand what iloc does. iloc is a method for selecting rows and columns in a DataFrame, based on their zero-indexed integer location in the DataFrame. That is, the first column (counting from left to right) will be column 0, the second column will be 1, and so on. The same applies to the rows (counting from top to bottom), the first row is row 0, second is row 1 and so on. The full syntax for iloc is:

df.iloc[<row numbers>, <column numbers>]

When we pass ":" to iloc before the "," as we did in the example above, we tell iloc to return all rows. If we passed ":" after the "," we would return all columns.

Now let’s combine a couple of these techniques. We can also pass lists of numbers for both the rows and columns to iloc. Before scrolling down, see if you can guess what the following will return:

df.iloc[[0, 1], [0, 1, 2]]

If you guessed the first two rows for the first three columns, well done!

   fixed acidity  volatile acidity  citric acid
0            7.4              0.70          0.0
1            7.8              0.88          0.0

Selecting rows

When it comes to selecting or filtering rows in a DataFrame, there are typically two scenarios:

  1. We have a list of rows we want to select based on the index; or
  2. We want to filter based on the values in one or more columns.

By name

Let’s start with the less common use case as it is the simpler one to understand. Every DataFrame by default has an index. This index works like column names for rows: we can use it to select a row or a selection of rows. But first let’s look at the index for the DataFrame from earlier:

df.index

RangeIndex(start=0, stop=1599, step=1)

This output tells us that our index is just a list of sequential numbers from 0 to 1598. This aligns exactly with the zero-indexes numbered rows we were using with iloc earlier. However, to show they are different things, let’s create a new column called id that will be the current index plus 10, and then set that column as the index:

df['id'] = df.index + 10
df.set_index('id', inplace=True)

df.index

Int64Index([  10,   11,   12,   13,   14,   15,   16,   17,   18,   19,
            ...
            1599, 1600, 1601, 1602, 1603, 1604, 1605, 1606, 1607, 1608], dtype='int64', name='id', length=1599)

Now we can use this new index to select the first row of the DataFrame, which now has the index 10:

df.loc[10]

fixed acidity            7.4000
volatile acidity         0.7000
citric acid              0.0000
residual sugar           1.9000
chlorides                0.0760
free sulfur dioxide     11.0000
total sulfur dioxide    34.0000
density                  0.9978
pH                       3.5100
sulphates                0.5600
alcohol                  9.4000
quality                  5.0000
Name: 10, dtype: float64

We are now selecting the first row by index. You can also test by trying to select row 0 to confirm that there is no longer a row with that index.

Now let’s talk a little about loc. loc is the named equivalent to iloc, meaning that instead of passing it lists of zero-indexed row and column numbers, we pass it the names of columns and the indices of rows:

df.loc[<row indices>, <column names>]

In practice, this ends up being much more useful than iloc. In fact, most experienced pandas users wouldn’t even have to take their socks off to count the number of times they’ve used iloc.

Something that is common to loc and iloc is that if we just want to select some rows for all columns, we don’t actually need to pass the columns at all. In other words df.loc[[10, 11, 12]] is the same as df.loc[[10, 11, 12], :].

By value

A very common use case is that we need to select a subset of rows based on the values in one or more columns. The way we do that is syntactically simple, but also very powerful once you understand what it is actually doing. Let’s start by looking at the syntax:

df[df['fixed acidity'] > 12]

This line of code will return all rows in the DataFrame where the value in the “fixed acidity” column is greater than 12. But why do we need to repeat df? Let’s reorganize our code again for a bit of clarity:

filter = df['fixed acidity'] > 12
df[filter]

The key to understanding what the code is doing is understanding what filter in the above code is. When we run the first line of code above, filter becomes a Series (basically a list with some metadata) that has the same length as the DataFrame (i.e. one value for every row), and each value in that Series is either True or False. For our example, the value will be True where fixed acidity is greater than 12, and False otherwise. When we pass that list of True and False values to the DataFrame (or to loc), it will return the rows with a True value.

Why is this important to understand? Because it means you aren’t limited to generating a list of True and False values using the columns of the DataFrame you are working with. For example, I can generate a list of True and False values based on arbitrary things like whether the row number (not the index) is divisible by 3 (or in other words, selecting every third row):

every_3rd_row = [i % 3 == 0 for i in range(len(df))]
df[every_3rd_row]

Obviously this is not something that you are likely to see used in practice, but the point is simply to show that once you understand that filters are just lists of True and False values, you are free to generate that list any way you want/need to.

Filtering a DataFrame for multiple conditions

What if we want to filter the DataFrame for multiple conditions? To do that, we are going going to need the following characters:

CharacterMeaning
&AND condition
|OR condition
~negation

Let’s look at an example:

filter = ((df["fixed acidity"] > 12) & (df["volatile acidity"] < 0.3)) | (df["quality"] == 3)
df[filter]

Using the same structure as before, separating out the filter from the line where we apply the filter to the DataFrame, we are now filtering for three conditions – “fixed acidity” has to be greater than 12 and “volatile acidity” has to be less than 0.3; or “quality” has to be equal to 3. Note that you can keep adding more and more conditions in the same way, you just need ensure to wrap each individual condition in parentheses “()”, and also use parentheses to specify how you want to group the conditions when you use an or condition.

There are lots of more advanced methods for creating these True/False series for filtering, but that is a subject for a separate explainer.

Selecting rows and columns

The above describes all the essential building blocks for how to filter and segment a DataFrame. We can select specific columns and we can filter the rows using multiple criteria. Now we just have to put it all together. This is where loc really shows its value; loc doesn’t just work with row indices, it also works with those lists of True and False values. So now we can combine those filters and a selection of column names:

filter = ((df["fixed acidity"] > 12) & (df["volatile acidity"] < 0.3)) | (df["quality"] == 3)
df.loc[filter, ["density", "pH"]]

We also aren’t limited to just selecting these values, we can also use these selections to overwrite values. For example, let’s imagine I really like sweet red wines. Before I show these ratings to someone who is going to buy wines for me, I want to assign a quality of 9 to the wines with residual sugar in the the top 1%. Let’s see how this could be done:

cutoff = df["residual sugar"].quantile(0.99)
df.loc[df["residual sugar"] >= cutoff, "quality"] = 9
df.sort_values(["residual sugar", "quality"], ascending=False).head(20)

In the first line, we work out the cutoff point in terms of residual sugar and assign that value to a variable called cutoff. We use the loc method to filter the rows and select the column we want to update, then assign the new higher rating to everything that meets the criteria. Finally, we use sort_values to confirm it worked.

Wrapping up

In this explainer we have looked at a range of ways to filter the rows and columns of a DataFrame. This includes the loc and iloc methods, and how a list of True and False values, however it is created, can be used to filter the rows in a DataFrame. We also looked at how we can filter a DataFrame based on very complex criteria using combinations of simple building blocks and and (&), or (|) and negation (~) operators. With these tools, you will be able to filter and segment a DataFrame in practically anyway you are likely to need.

brettromero

Leave a Reply

Your email address will not be published.