Python Pandas compare two dataframes with a similar (string) column, where 1 df's values are substrings of the other df's values

Question

I have a large CSV with phone number prefixes, and corresponding country names. Then I have a smaller dataframe with prefixes, and corresponding country names that are spelled wrong. I need to find a way to replace the smaller df's country names with the ones from the CSV. Unfortunately, I can't simply compare csv_df['prefix'] == small_df['prefix'].

The smaller dataframe contains very long, specific prefixes ("3167777"), while the CSV has less specific prefixes ("31"). As you can imagine, a prefix like 3167777 is a subcategory of 31, and therefore belongs to the CSV's 31-group. However, if the CSV contains a more specific prefix ("316"), then the longer prefix should only belong to that group.

Dummy example data:

# CSV-df with correct country names
prefix      country
93,         Portugal
937,        Portugal Special Numbers
31,         Netherlands
316,        Netherlands Special Numbers

# Smaller df with wrong country names
prefix      country
93123,      PT
9377777,    PT
3121,       NL
31612345,   NL
31677777,   NL
31688888,   NL

Ideal result:

# Smaller df, but with correct country names
prefix      country
9312345,    Portugal
9377777,    Portugal Special Numbers
3121,       Netherlands
31612345,   Netherlands Special Numbers
31677777,   Netherlands Special Numbers
31688888,   Netherlands Special Numbers

So far, I have a very slow, non-elegant solution: I've put the smaller df's prefixes in a list, and loop through the list, constantly .

# small_df_prefixes = ['9312345', '3167', '31612345', ...]
for prefix in small_df_prefixes:
    # See if the prefix occurs in the CSV
    if prefix in [p[:len(prefix)] for p in csv_df['prefix']]:
        new_dest = csv_df.loc[csv_df['prefix'] == prefix]['destination'].values[0]
    else:
        new_dest = 'UNKNOWN'

This runs slowly and does not give me the solution I need. I've looked into panda's startswith and contains functions but can't quite figure out how to use them correctly here. I hope someone can help me out with a better solution.

Is is guaranteed that the prefixed are unique? Moreover, if they are unique, why would you have Portugal two times, both starting with 93? Wouldn't it be easier if you had a key mapping like {'PT': 'Portugal', 'NL': 'Netherlands'}? — Felipe Whitaker
– Felipe Whitaker, Commented Oct 6, 2021 at 20:46
Ah I made the dummy data a bit too dummy. Basically, there's some large prefix-categories in the CSV, every category is basically a phone provider in the country. There are so many providers that mapping them manually would be impossible... A prefix never occurs more than once in a df. However, the more provider-specific, long prefixes (316777) do have overlap with the basic country prefixes (316). — Sinceris
– Sinceris, Commented Oct 6, 2021 at 20:51
But do you need to map 93 to Portugal and 937 to Portugal Special Numbers? — Felipe Whitaker
– Felipe Whitaker, Commented Oct 6, 2021 at 20:53
Yes I do. So if I have new phone number, for example: 93777, it would be long to Portugal Special Numbers. But 93111 would belong to Portugal. — Sinceris
– Sinceris, Commented Oct 6, 2021 at 20:56

Felipe Whitaker · Accepted Answer · 2021-10-07 01:28:48Z

1

I believe that you should take a look into pandas.DataFrame.apply. It enable you to apply a function either column or row wise on a DataFrame.

The idea here is to order df_right (which has the right names) by the lenght of the prefix (the longer the more specific) and then tries to exactly match the prefix in df_wrong to a prefix in df_right.

import pandas as pd

df_wrong = pd.read_csv('path/to/wrong.csv')
# df_wrong
# prefix      country
# 9312345     PT
# 3167        NL
# ...         ...

df_right = pd.read_csv('path/to/right.csv')
# df_right
# prefix      country
# 93          Portugal
# 937         Portugal Special Numbers
# 31          Netherlands
# ...         ...

# because prefixes 93 != 937, we will sort values by their len, as 93 would contain 937
df_right.sort_values(by = 'prefix', key = lambda prefix: prefix.len(), inplace = True, ascending = False)

def correct_country(prefix):
    global df_right
    equal_match = df_right[df_right.prefix == prefix].country
    if equal_math.shape[0] == 0: # no matches
         return 'UNKNOWN'
    return equal_match[0]

df_wrong['corrected_country'] = df_wrong['prefix'].apply(correct_prefix)

Resources:

Sort Dataframe by String Lenght

edited Oct 7, 2021 at 1:28

answered Oct 6, 2021 at 21:07

Felipe Whitaker

5605 silver badges9 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Sinceris Over a year ago

I've tried running this but I'm getting some errors... Especially on the line: df_right.sort_values(... the lambda function is giving me a keyerror: 'prefix'

Sinceris Over a year ago

Ok after some tweaking i got it to work... However, the correct_country() function only looks at exact matches (example: 3167777 == 317777). However, I need it to look at the first n digits of the prefix: if 3167777 doesnt exist but 316 does, then I need 316777 to get 316's country label.

Sinceris Over a year ago

The line equal_match = df_right[df_right.prefix == prefix].country is the only one that needs changing. Instead of ==, I need something that says "the longest prefix in df_right that matches the first digits of the current prefix". // Ive thought of using 'startswith', but I don't know how to apply it to get the result I want (the country).

Felipe Whitaker Over a year ago

I'm sorry about the prefix thing: I was not sure if pandas would pass the entire DataFrame or just the column. I believe that, instead of using the boolean part (inside the brackets), use df_right.prefix.apply(lambda p: p.startswith(prefix)) which will return a boolean series where that column starts with the prefix you want. Then, you just need to get the lowest match (index -1) or get the first one if ascending = True. Hope this helps! Good coding!

Collectives™ on Stack Overflow

Python Pandas compare two dataframes with a similar (string) column, where 1 df's values are substrings of the other df's values

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related