Fastest Way To Filter A Pandas Dataframe Using A List

Question

Suppose I have a DataFrame such as:

   col1  col2
0     1     A
1     2     B
2     6     A
3     5     C
4     9     C
5     3     A
6     5     B

And multiple lists such as:

list_1 = [1, 2, 4]
list_2 = [3, 8]
list_3 = [5, 6, 7, 9]

I can update the value of col2 depending on whether the value of col1 is included in a list, for example:

for i in list_1:
    df.loc[df.col1 == i, 'col2'] = 'A'

for i in list_2:
    df.loc[df.col1 == i, 'col2'] = 'B'

for i in list_3:
    df.loc[df.col1 == i, 'col2'] = 'C'

However this is very slow. With a dataframe of 30,000 rows, and each list containing approx 5,000-10,000 items, it can take a long time to calculate, especially compared to other pandas operations. Is there a better (faster) way of doing this?

pandas.pydata.org/docs/user_guide/indexing.html

wwii
– wwii

2020-04-30 03:51:37 +00:00
Commented Apr 30, 2020 at 3:51 — wwii
– wwii, Commented Apr 30, 2020 at 3:51

anky · Accepted Answer · 2020-04-30 04:09:19Z

6

You can use isin with np.select here:

df['col2'] = (np.select([df['col1'].isin(list_1),
                         df['col1'].isin(list_2),
                         df['col1'].isin(list_3)]
                    ,['A','B','C']))

With Map:

d = dict(zip(map(tuple,[list_1,list_2,list_3]),['A','B','C']))
df['col2'] = df['col1'].map({val: v for k,v in d.items() for val in k})

   col1 col2
0     1    A
1     2    A
2     6    C
3     5    C
4     9    C
5     3    B
6     5    C

edited Apr 30, 2020 at 4:09

answered Apr 30, 2020 at 3:40

anky

75.3k11 gold badges46 silver badges76 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Allen Qin · Accepted Answer · 2020-04-30 03:55:10Z

4

You can first convert the lists to dicts and then map to col1.

d1 = {k:'A' for k in list_1}
d2 = {k:'B' for k in list_2}
d3 = {k:'C' for k in list_3}

df['col2'] = (
    df.col1.apply(lambda x: d1.get(x,x))
    .combine_first(df.col1.apply(lambda x: d2.get(x,x)))
    .combine_first(df.col1.apply(lambda x: d2.get(x,x)))
)

If there is no duplicates in the lists, you can make it even faster by merging them to a single dict:

d = {**{k:'A' for k in list_1}, 
     **{k:'B' for k in list_2}, 
     **{k:'C' for k in list_3}}
df['col2'] = df.col1.apply(lambda x: d.get(x,x))

edited Apr 30, 2020 at 3:55

answered Apr 30, 2020 at 3:44

Allen Qin

20k9 gold badges55 silver badges68 bronze badges

3 Comments

Allen Qin Over a year ago

I've used map at first but if the value of col1 doesn't exist in the dict, map will return an nan, using get will make sure the value stays the same.

Umar.H Over a year ago

Awesome answer :) +1

Alan Over a year ago

Thank you, 3,400 times faster than my existing method :)

Yaakov Bressler · Accepted Answer · 2020-05-10 21:15:26Z

1

I would suggest iterating through your lists with a dictionary using conditional updating:

# Create your update dictionary
col_dict = {
    "A":[1, 2, 4],
    "B":[3, 8],
    "C":[5, 6, 7, 9]
}

# Iterate and update
for key, value in col_dict.items():
  # key is the col name; value is the lookup list
  df["col2"] = np.where(df["col1"].isin(value), key, df["col2"])

There is a concern of overwriting values – since a row can technically match multiple lists. How those updates are reconciled is not obvious.

If rows don't match multiple keys, consider a dynamic programming approach where a running index of "unmatched" rows are used for each iteration, updating as your proceed so that the number of rows you're iterating through are fewer with each iteration.

edited May 10, 2020 at 21:15

answered Apr 30, 2020 at 4:14

Yaakov Bressler

12.7k5 gold badges66 silver badges96 bronze badges

3 Comments

Alan Over a year ago

Thank you, 3,400 times faster than my existing method :)

Yaakov Bressler Over a year ago

Sweet! @Alan – how many seconds for the whole operation? out of curiousity

Alan Over a year ago

Previously, using my for loop method on a dataset on 2,700 rows, it took 3.471s to calculate. Using this method, it took 0.0009923s. A massive difference especially if the dataset has more than a few thousand rows. The other answers above all had a similar speed so I imagine they're all using the same underlying principles. Thanks again :)

Collectives™ on Stack Overflow

Fastest Way To Filter A Pandas Dataframe Using A List

3 Answers 3

Comments

3 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

3 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related