pandas: filter rows of DataFrame with operator chaining

Question

Most operations in pandas can be accomplished with operator chaining (groupby, aggregate, apply, etc), but the only way I've found to filter rows is via normal bracket indexing

df_filtered = df[df['column'] == value]

This is unappealing as it requires I assign df to a variable before being able to filter on its values. Is there something more like the following?

df_filtered = df.mask(lambda x: x['column'] == value)

df.query and pd.eval seem like good fits for this use case. For information on the pd.eval() family of functions, their features and use cases, please visit Dynamic Expression Evaluation in pandas using pd.eval(). — cs95
– cs95, Commented Dec 16, 2018 at 4:54
dynamic expressions disallow any interpreter context help and are often a lower level of productivity/reliability. — WestCoastProjects
– WestCoastProjects, Commented Sep 13, 2021 at 22:50

Andrew · Accepted Answer · 2016-12-20 15:23:26Z

479

I'm not entirely sure what you want, and your last line of code does not help either, but anyway:

"Chained" filtering is done by "chaining" the criteria in the boolean index.

In [96]: df
Out[96]:
   A  B  C  D
a  1  4  9  1
b  4  5  0  2
c  5  5  1  0
d  1  3  9  6

In [99]: df[(df.A == 1) & (df.D == 6)]
Out[99]:
   A  B  C  D
d  1  3  9  6

If you want to chain methods, you can add your own mask method and use that one.

In [90]: def mask(df, key, value):
   ....:     return df[df[key] == value]
   ....:

In [92]: pandas.DataFrame.mask = mask

In [93]: df = pandas.DataFrame(np.random.randint(0, 10, (4,4)), index=list('abcd'), columns=list('ABCD'))

In [95]: df.ix['d','A'] = df.ix['a', 'A']

In [96]: df
Out[96]:
   A  B  C  D
a  1  4  9  1
b  4  5  0  2
c  5  5  1  0
d  1  3  9  6

In [97]: df.mask('A', 1)
Out[97]:
   A  B  C  D
a  1  4  9  1
d  1  3  9  6

In [98]: df.mask('A', 1).mask('D', 6)
Out[98]:
   A  B  C  D
d  1  3  9  6

edited Dec 20, 2016 at 15:23

Andrew

7,5363 gold badges30 silver badges38 bronze badges

answered Aug 8, 2012 at 20:10

Wouter Overmeire

69.7k10 gold badges67 silver badges44 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

Shawn Over a year ago

Great answer! So in (df.A == 1) & (df.D == 6), is the "&" an overloaded operator in Pandas?

Wouter Overmeire Over a year ago

indeed, see also pandas.pydata.org/pandas-docs/stable/…

naught101 Over a year ago

That is a really nice solution - I wasn't even aware that you could jury-rig methods like that in python. A function like this would be really nice to have in Pandas itself.

Wouter Overmeire Over a year ago

Indeed import pandas as pd is common practice now. I doubt it was when i answered the question.

Himanshu Gautam Over a year ago

The answer did taught something new. But I would prefer query() for now, as it would be easy to understand later.

|

Rémy Hosseinkhan Boucher · Accepted Answer · 2020-02-13 15:56:55Z

173

Filters can be chained using a Pandas query:

df = pd.DataFrame(np.random.randn(30, 3), columns=['a','b','c'])
df_filtered = df.query('a > 0').query('0 < b < 2')

Filters can also be combined in a single query:

df_filtered = df.query('a > 0 and 0 < b < 2')

edited Feb 13, 2020 at 15:56

Rémy Hosseinkhan Boucher

1902 silver badges10 bronze badges

answered Jan 26, 2015 at 21:44

bscan

3,0761 gold badge18 silver badges17 bronze badges

5 Comments

teichert Over a year ago

If you need to refer to python variables in your query, the documentation says, "You can refer to variables in the environment by prefixing them with an ‘@’ character like @a + b". Note that the following are valid: df.query('a in list([1,2])'), s = set([1,2]); df.query('a in @s').

teichert Over a year ago

On the other hand, it looks like the query evaluation will fail if your column name has certain special characters: e.g. "Place.Name".

piRSquared Over a year ago

Chaining is what query is designed for.

KH Kim Over a year ago

@teichert you can use backtick as described in this post(stackoverflow.com/questions/59167183/…)

teichert Over a year ago

@KHKim Nice! It looks like that support for dotted names in backticks was added in v1.0.0.

Daniel · Accepted Answer · 2012-08-09 23:20:59Z

79

The answer from @lodagro is great. I would extend it by generalizing the mask function as:

def mask(df, f):
  return df[f(df)]

Then you can do stuff like:

df.mask(lambda x: x[0] < 0).mask(lambda x: x[1] > 0)

answered Aug 9, 2012 at 23:20

Daniel

27.8k12 gold badges65 silver badges88 bronze badges

1 Comment

duckworthd Over a year ago

A useful generalization! I wish it were integrated directly into DataFrames already!

Scarabee · Accepted Answer · 2023-12-12 21:42:01Z

43

Since version 0.18.1 the .loc method accepts a callable for selection. Together with lambda functions you can create very flexible chainable filters:

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
df.loc[lambda df: df.A == 80]  # equivalent to df[df.A == 80] but chainable

df.sort_values('A').loc[lambda df: df.A > 80].loc[lambda df: df.B > df.A]

If all you're doing is filtering, you can also omit the .loc.

edited Dec 12, 2023 at 21:42

Scarabee

5,7545 gold badges32 silver badges59 bronze badges

answered Sep 5, 2017 at 10:14

Rafael Barbosa

1,19012 silver badges17 bronze badges

Comments

Pietro Battiston · Accepted Answer · 2021-11-13 15:07:05Z

40

pandas provides two alternatives to Wouter Overmeire's answer which do not require any overriding. One is .loc[.] with a callable, as in

df_filtered = df.loc[lambda x: x['column'] == value]

the other is .pipe(), as in

df_filtered = df.pipe(lambda x: x.loc[x['column'] == value])

edited Nov 13, 2021 at 15:07

answered Mar 22, 2018 at 14:44

Pietro Battiston

8,5103 gold badges48 silver badges48 bronze badges

4 Comments

Lucas Lima Over a year ago

This is the best answer I've found so far. This allows for easy chaining and it is completely independent of the dataframe name, while maintaining a minimal syntax check (unlike "query"). Really neat approach, thanks.

ecotner Over a year ago

+1 This should really be the accepted answer. It's built-in to pandas and requires no monkey-patching, and is the most flexible. I would also add that you can have your callable return an iterable of indexes as well, not just a boolean series.

Jayron Soares Over a year ago

Great answer, if anyone need with two columns, follows: pandasDF.loc[lambda n: (n['col1'] == 'value') | (n[col2']=='value')]

veg2020 Over a year ago

Thank you for a simple answer that works with method chaining and the additional comments on how to include multiple conditions to filter on!

piRSquared · Accepted Answer · 2018-04-16 17:47:26Z

18

I offer this for additional examples. This is the same answer as https://stackoverflow.com/a/28159296/

I'll add other edits to make this post more useful.

pandas.DataFrame.query
query was made for exactly this purpose. Consider the dataframe df

import pandas as pd
import numpy as np

np.random.seed([3,1415])
df = pd.DataFrame(
    np.random.randint(10, size=(10, 5)),
    columns=list('ABCDE')
)

df

   A  B  C  D  E
0  0  2  7  3  8
1  7  0  6  8  6
2  0  2  0  4  9
3  7  3  2  4  3
4  3  6  7  7  4
5  5  3  7  5  9
6  8  7  6  4  7
7  6  2  6  6  5
8  2  8  7  5  8
9  4  7  6  1  5

Let's use query to filter all rows where D > B

df.query('D > B')

   A  B  C  D  E
0  0  2  7  3  8
1  7  0  6  8  6
2  0  2  0  4  9
3  7  3  2  4  3
4  3  6  7  7  4
5  5  3  7  5  9
7  6  2  6  6  5

Which we chain

df.query('D > B').query('C > B')
# equivalent to
# df.query('D > B and C > B')
# but defeats the purpose of demonstrating chaining

   A  B  C  D  E
0  0  2  7  3  8
1  7  0  6  8  6
4  3  6  7  7  4
5  5  3  7  5  9
7  6  2  6  6  5

edited Apr 16, 2018 at 17:47

answered Jun 4, 2017 at 4:40

piRSquared

296k68 gold badges509 silver badges654 bronze badges

1 Comment

bscan Over a year ago

Isn't this basically the same answer as stackoverflow.com/a/28159296 Is there something missing from that answer that you think should be clarified?

Stewbaca · Accepted Answer · 2016-05-27 19:00:59Z

13

My answer is similar to the others. If you do not want to create a new function you can use what pandas has defined for you already. Use the pipe method.

df.pipe(lambda d: d[d['column'] == value])

answered May 27, 2016 at 19:00

Stewbaca

5757 silver badges9 bronze badges

1 Comment

Stefan Falk Over a year ago

THIS is what you want if you want to chain commands such as a.join(b).pipe(lambda df: df[df.column_to_filter == 'VALUE'])

sharon · Accepted Answer · 2015-03-25 04:00:24Z

10

I had the same question except that I wanted to combine the criteria into an OR condition. The format given by Wouter Overmeire combines the criteria into an AND condition such that both must be satisfied:

In [96]: df
Out[96]:
   A  B  C  D
a  1  4  9  1
b  4  5  0  2
c  5  5  1  0
d  1  3  9  6

In [99]: df[(df.A == 1) & (df.D == 6)]
Out[99]:
   A  B  C  D
d  1  3  9  6

But I found that, if you wrap each condition in (... == True) and join the criteria with a pipe, the criteria are combined in an OR condition, satisfied whenever either of them is true:

df[((df.A==1) == True) | ((df.D==6) == True)]

answered Mar 25, 2015 at 4:00

sharon

4,6561 gold badge20 silver badges15 bronze badges

2 Comments

eenblam Over a year ago

Wouldn't df[(df.A==1) | (df.D==6)] be sufficient for what you're trying to accomplish?

MGB.py Over a year ago

No, it wouldn't because it give bollean results (True vs False) instead of as it is above which filter all data which satisfy the condition. Hope that I made it clear.

Ken T · Accepted Answer · 2018-01-29 06:18:36Z

7

Just want to add a demonstration using loc to filter not only by rows but also by columns and some merits to the chained operation.

The code below can filter the rows by value.

df_filtered = df.loc[df['column'] == value]

By modifying it a bit you can filter the columns as well.

df_filtered = df.loc[df['column'] == value, ['year', 'column']]

So why do we want a chained method? The answer is that it is simple to read if you have many operations. For example,

res =  df\
    .loc[df['station']=='USA', ['TEMP', 'RF']]\
    .groupby('year')\
    .agg(np.nanmean)

answered Jan 29, 2018 at 6:18

Ken T

2,5531 gold badge27 silver badges35 bronze badges

Comments

Cam · Accepted Answer · 2022-01-27 21:41:58Z

So the way I see it is that you do two things when sub-setting your data ready for analysis.

get rows
get columns

Pandas has a number of ways of doing each of these and some techniques that help get rows and columns. For new Pandas users it can be confusing as there is so much choice.

Do you use iloc, loc, brackets, query, isin, np.where, mask etc...

Method chaining

Now method chaining is a great way to work when data wrangling. In R they have a simple way of doing it, you select() columns and you filter() rows.

So if we want to keep things simple in Pandas why not use the filter() for columns and the query() for rows. These both return dataframes and so no need to mess-around with boolean indexing, no need to add df[ ] round the return value.

So what does that look like:-

df.filter(['col1', 'col2', 'col3']).query("col1 == 'sometext'")

You can then chain on any other methods like groupby, dropna(), sort_values(), reset_index() etc etc.

By being consistent and using filter() to get your columns and query() to get your rows it will be easier to read your code when coming back to it after a time.

But filter can select rows?

Yes this is true but by default query() get rows and filter() get columns. So if you stick with the default there is no need to use the axis= parameter.

query()

query() can be used with both and/or &/| you can also use comparison operators > , < , >= , <=, ==, !=. You can also use Python in, not in.

You can pass a list to query using @my_list

Some examples of using query to get rows

df.query('A > B')

df.query('a not in b')

df.query("series == '2206'")

df.query("col1 == @mylist")

df.query('Salary_in_1000 >= 100 & Age < 60 & FT_Team.str.startswith("S").values')

filter()

So filter is basicly like using bracket df[] or df[[]] in that it uses the labels to select columns. But it does more than the bracket notation.

filter has like= param so as to help select columns with partial names.

df.filter(like='partial_name',)

filter also has regex to help with selection

df.filter(regex='reg_string')

So to sum up this way of working might not work for ever situation e.g. if you want to use indexing/slicing then iloc is the way to go. But this does seem to be a solid way of working and can simplify your workflow and code.

dantes_419 · Accepted Answer · 2013-04-18 04:44:28Z

If you would like to apply all of the common boolean masks as well as a general purpose mask you can chuck the following in a file and then simply assign them all as follows:

pd.DataFrame = apply_masks()

Usage:

A = pd.DataFrame(np.random.randn(4, 4), columns=["A", "B", "C", "D"])
A.le_mask("A", 0.7).ge_mask("B", 0.2)... (May be repeated as necessary

It's a little bit hacky but it can make things a little bit cleaner if you're continuously chopping and changing datasets according to filters. There's also a general purpose filter adapted from Daniel Velkov above in the gen_mask function which you can use with lambda functions or otherwise if desired.

File to be saved (I use masks.py):

import pandas as pd

def eq_mask(df, key, value):
    return df[df[key] == value]

def ge_mask(df, key, value):
    return df[df[key] >= value]

def gt_mask(df, key, value):
    return df[df[key] > value]

def le_mask(df, key, value):
    return df[df[key] <= value]

def lt_mask(df, key, value):
    return df[df[key] < value]

def ne_mask(df, key, value):
    return df[df[key] != value]

def gen_mask(df, f):
    return df[f(df)]

def apply_masks():

    pd.DataFrame.eq_mask = eq_mask
    pd.DataFrame.ge_mask = ge_mask
    pd.DataFrame.gt_mask = gt_mask
    pd.DataFrame.le_mask = le_mask
    pd.DataFrame.lt_mask = lt_mask
    pd.DataFrame.ne_mask = ne_mask
    pd.DataFrame.gen_mask = gen_mask

    return pd.DataFrame

if __name__ == '__main__':
    pass

Pietro Battiston · Accepted Answer · 2018-03-15 16:28:30Z

This solution is more hackish in terms of implementation, but I find it much cleaner in terms of usage, and it is certainly more general than the others proposed.

https://github.com/toobaz/generic_utils/blob/master/generic_utils/pandas/where.py

You don't need to download the entire repo: saving the file and doing

from where import where as W

should suffice. Then you use it like this:

df = pd.DataFrame([[1, 2, True],
                   [3, 4, False], 
                   [5, 7, True]],
                  index=range(3), columns=['a', 'b', 'c'])
# On specific column:
print(df.loc[W['a'] > 2])
print(df.loc[-W['a'] == W['b']])
print(df.loc[~W['c']])
# On entire - or subset of a - DataFrame:
print(df.loc[W.sum(axis=1) > 3])
print(df.loc[W[['a', 'b']].diff(axis=1)['b'] > 1])

A slightly less stupid usage example:

data = pd.read_csv('ugly_db.csv').loc[~(W == '$null$').any(axis=1)]

By the way: even in the case in which you are just using boolean cols,

df.loc[W['cond1']].loc[W['cond2']]

can be much more efficient than

df.loc[W['cond1'] & W['cond2']]

because it evaluates cond2 only where cond1 is True.

DISCLAIMER: I first gave this answer elsewhere because I hadn't seen this.

serv-inc · Accepted Answer · 2019-06-04 14:33:53Z

3

This is unappealing as it requires I assign df to a variable before being able to filter on its values.

df[df["column_name"] != 5].groupby("other_column_name")

seems to work: you can nest the [] operator as well. Maybe they added it since you asked the question.

edited Jun 4, 2019 at 14:33

answered Apr 19, 2018 at 8:17

serv-inc

38.8k9 gold badges195 silver badges215 bronze badges

2 Comments

Daan Luttik Over a year ago

This makes little sense in a chain because df now doesn't necessarily reference the output of the previour part of te chain.

serv-inc Over a year ago

@DaanLuttik: agreed, it is not chaining, but nesting. Better for you?

Akash Basudevan · Accepted Answer · 2018-01-25 07:22:23Z

2

You can also leverage the numpy library for logical operations. Its pretty fast.

df[np.logical_and(df['A'] == 1 ,df['B'] == 6)]

answered Jan 25, 2018 at 7:22

Akash Basudevan

8315 silver badges15 bronze badges

Comments

naught101 · Accepted Answer · 2017-07-26 03:35:33Z

1

If you set your columns to search as indexes, then you can use DataFrame.xs() to take a cross section. This is not as versatile as the query answers, but it might be useful in some situations.

import pandas as pd
import numpy as np

np.random.seed([3,1415])
df = pd.DataFrame(
    np.random.randint(3, size=(10, 5)),
    columns=list('ABCDE')
)

df
# Out[55]: 
#    A  B  C  D  E
# 0  0  2  2  2  2
# 1  1  1  2  0  2
# 2  0  2  0  0  2
# 3  0  2  2  0  1
# 4  0  1  1  2  0
# 5  0  0  0  1  2
# 6  1  0  1  1  1
# 7  0  0  2  0  2
# 8  2  2  2  2  2
# 9  1  2  0  2  1

df.set_index(['A', 'D']).xs([0, 2]).reset_index()
# Out[57]: 
#    A  D  B  C  E
# 0  0  2  2  2  2
# 1  0  2  1  1  0

answered Jul 26, 2017 at 3:35

naught101

19.7k20 gold badges97 silver badges143 bronze badges

Collectives™ on Stack Overflow

pandas: filter rows of DataFrame with operator chaining

15 Answers 15

9 Comments

5 Comments

1 Comment

Comments

4 Comments

1 Comment

1 Comment

2 Comments

Comments

Comments

Comments

Comments

2 Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

15 Answers 15

9 Comments

5 Comments

1 Comment

Comments

4 Comments

1 Comment

1 Comment

2 Comments

Comments

Comments

Comments

Comments

2 Comments

Comments

Comments

Linked

Related