0

I have a large CSV with phone number prefixes, and corresponding country names. Then I have a smaller dataframe with prefixes, and corresponding country names that are spelled wrong. I need to find a way to replace the smaller df's country names with the ones from the CSV. Unfortunately, I can't simply compare csv_df['prefix'] == small_df['prefix'].

The smaller dataframe contains very long, specific prefixes ("3167777"), while the CSV has less specific prefixes ("31"). As you can imagine, a prefix like 3167777 is a subcategory of 31, and therefore belongs to the CSV's 31-group. However, if the CSV contains a more specific prefix ("316"), then the longer prefix should only belong to that group.

Dummy example data:

# CSV-df with correct country names
prefix      country
93,         Portugal
937,        Portugal Special Numbers
31,         Netherlands
316,        Netherlands Special Numbers

# Smaller df with wrong country names
prefix      country
93123,      PT
9377777,    PT
3121,       NL
31612345,   NL
31677777,   NL
31688888,   NL

Ideal result:

# Smaller df, but with correct country names
prefix      country
9312345,    Portugal
9377777,    Portugal Special Numbers
3121,       Netherlands
31612345,   Netherlands Special Numbers
31677777,   Netherlands Special Numbers
31688888,   Netherlands Special Numbers

So far, I have a very slow, non-elegant solution: I've put the smaller df's prefixes in a list, and loop through the list, constantly .

# small_df_prefixes = ['9312345', '3167', '31612345', ...]
for prefix in small_df_prefixes:
    # See if the prefix occurs in the CSV
    if prefix in [p[:len(prefix)] for p in csv_df['prefix']]:
        new_dest = csv_df.loc[csv_df['prefix'] == prefix]['destination'].values[0]
    else:
        new_dest = 'UNKNOWN'

This runs slowly and does not give me the solution I need. I've looked into panda's startswith and contains functions but can't quite figure out how to use them correctly here. I hope someone can help me out with a better solution.

4
  • Is is guaranteed that the prefixed are unique? Moreover, if they are unique, why would you have Portugal two times, both starting with 93? Wouldn't it be easier if you had a key mapping like {'PT': 'Portugal', 'NL': 'Netherlands'}? Commented Oct 6, 2021 at 20:46
  • Ah I made the dummy data a bit too dummy. Basically, there's some large prefix-categories in the CSV, every category is basically a phone provider in the country. There are so many providers that mapping them manually would be impossible... A prefix never occurs more than once in a df. However, the more provider-specific, long prefixes (316777) do have overlap with the basic country prefixes (316). Commented Oct 6, 2021 at 20:51
  • But do you need to map 93 to Portugal and 937 to Portugal Special Numbers? Commented Oct 6, 2021 at 20:53
  • Yes I do. So if I have new phone number, for example: 93777, it would be long to Portugal Special Numbers. But 93111 would belong to Portugal. Commented Oct 6, 2021 at 20:56

1 Answer 1

1

I believe that you should take a look into pandas.DataFrame.apply. It enable you to apply a function either column or row wise on a DataFrame.

The idea here is to order df_right (which has the right names) by the lenght of the prefix (the longer the more specific) and then tries to exactly match the prefix in df_wrong to a prefix in df_right.

import pandas as pd

df_wrong = pd.read_csv('path/to/wrong.csv')
# df_wrong
# prefix      country
# 9312345     PT
# 3167        NL
# ...         ...

df_right = pd.read_csv('path/to/right.csv')
# df_right
# prefix      country
# 93          Portugal
# 937         Portugal Special Numbers
# 31          Netherlands
# ...         ...

# because prefixes 93 != 937, we will sort values by their len, as 93 would contain 937
df_right.sort_values(by = 'prefix', key = lambda prefix: prefix.len(), inplace = True, ascending = False)

def correct_country(prefix):
    global df_right
    equal_match = df_right[df_right.prefix == prefix].country
    if equal_math.shape[0] == 0: # no matches
         return 'UNKNOWN'
    return equal_match[0]

df_wrong['corrected_country'] = df_wrong['prefix'].apply(correct_prefix)

Resources:

Sign up to request clarification or add additional context in comments.

4 Comments

I've tried running this but I'm getting some errors... Especially on the line: df_right.sort_values(... the lambda function is giving me a keyerror: 'prefix'
Ok after some tweaking i got it to work... However, the correct_country() function only looks at exact matches (example: 3167777 == 317777). However, I need it to look at the first n digits of the prefix: if 3167777 doesnt exist but 316 does, then I need 316777 to get 316's country label.
The line equal_match = df_right[df_right.prefix == prefix].country is the only one that needs changing. Instead of ==, I need something that says "the longest prefix in df_right that matches the first digits of the current prefix". // Ive thought of using 'startswith', but I don't know how to apply it to get the result I want (the country).
I'm sorry about the prefix thing: I was not sure if pandas would pass the entire DataFrame or just the column. I believe that, instead of using the boolean part (inside the brackets), use df_right.prefix.apply(lambda p: p.startswith(prefix)) which will return a boolean series where that column starts with the prefix you want. Then, you just need to get the lowest match (index -1) or get the first one if ascending = True. Hope this helps! Good coding!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.