2

I have a pandas dataframe with several columns that labels data in a final column, for example,

df = pd.DataFrame( {'1_label' : ['a1','b1','c1','d1'],
                    '2_label' : ['a2','b2','c2','d2'],
                    '3_label' : ['a3','b3','c3','d3'],
                    'data'    : [1,2,3,4]})

df =      1_label 2_label 3_label  data
     0      a1      a2      a3     1
     1      b1      b2      b3     2
     2      c1      c2      c3     3
     3      d1      d2      d3     4

and a list of tuples,

list_t = [('a1','a2','a3'), ('d1','d2','d3')]

I want to filter this dataframe and return a new dataframe containing only the rows that correspond to the tuples in my list.

result =        1_label 2_label 3_label  data
            0      a1      a2      a3     1
            1      d1      d2      d3     4

My naive (and C++ inspired) solution was to use append (like vector::push_back)

for l1, l2, l3 in list_t:
    if df[(df['1_label'] == l1) & 
          (df['2_label'] == l2) & 
          (df['3_label'] == l3)].empty is False:
        result = result.append(df[(df['1_label'] == l1) & 
                              (df['2_label'] == l2) &
                              (df['3_label'] == l3)]

While my solution works I suspect it is horrendously slow for large dataframes and large list of tuples as I think pandas creates a new dataframe upon each call to append. Could anyone suggest a faster/cleaner way to do this? Thanks!

1
  • I think your question is probably better suited for Code Review Commented Jul 15, 2016 at 11:15

2 Answers 2

5

If I understood correctly, merge should do the job:

pd.DataFrame(list_t, columns=['1_label', '2_label', '3_label']).merge(df)
Out[73]: 
  1_label 2_label 3_label  data
0      a1      a2      a3     1
1      d1      d2      d3     4
Sign up to request clarification or add additional context in comments.

1 Comment

This is the faster solution, tbh.
4

Assuming no duplicates, you could create index out of the columns you want to "filter" on:

In [10]: df
Out[10]: 
  1_label 2_label 3_label  data
0      a1      a2      a3     1
1      b1      b2      b3     2
2      c1      c2      c3     3
3      d1      d2      d3     4

In [11]: df.set_index(['1_label', '2_label', '3_label'])\
    .loc[[('a1','a2','a3'), ('d1','d2','d3')]]\
    .reset_index()
Out[11]: 
  1_label 2_label 3_label  data
0      a1      a2      a3     1
1      d1      d2      d3     4

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.