11

I am new to Python and have a question regarding matching strings in a list to a column in a df.

When I run the following commands, I would like a new column named "Match" to be created, and if there is a match between the string in the list and the string in the column the value in the "Match" column and corresponding row should be True, if no match, then False. The desired outcome would be False, False, True, False, False. Since the string "Honda" is not an exact match of "Honda Civic" it should not be True. Same with "Toy" is not the exact match of "Toyota Corolla".

Creating df:

Cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4', np.nan],
    'Price': [22000,25000,27000,35000, 29000],
    'Liscence Plate': ['ABC 123', 'XYZ 789', 'CBA 321', 'ZYX 987', 'DEF 456']}

df = DataFrame(Cars,columns= ['Brand', 'Price', 'Liscence Plate'])

I then create a list of the values I would like to search for, joined by |.

search_for_these_values = ['Honda', 'Toy', 'Ford Focus', 'Audi A4 2019']
pattern = '|'.join(search_for_these_values)

Here I have tried the str.match command and are given True, True, True, False, False.

df['Match'] = df["Brand"].str.match(pattern, na=False)

Here I have created a loop using the == operator and are given False, False, False, False, False.

for i in range(0,len(pattern)):
    df['Match'] = df['Brand'] == pattern[i]

Thank you for the help!

2
  • you are looking for isin: df.Brand.isin(search_for_these_values) Commented May 16, 2019 at 13:53
  • 1
    Great, thank you for the help! Commented May 16, 2019 at 14:33

1 Answer 1

9

If need match values in list, use Series.isin:

df['Match'] = df["Brand"].isin(search_for_these_values)
print (df)
            Brand  Price Liscence Plate  Match
0     Honda Civic  22000        ABC 123  False
1  Toyota Corolla  25000        XYZ 789  False
2      Ford Focus  27000        CBA 321   True
3         Audi A4  35000        ZYX 987  False
4             NaN  29000        DEF 456  False

Solution with match is used for check substrings, so different output.

Alternative solution for match substrings with Series.str.contains and parameter na=False:

df['Match'] = df["Brand"].str.contains(pattern, na=False)
print (df)
            Brand  Price Liscence Plate  Match
0     Honda Civic  22000        ABC 123   True
1  Toyota Corolla  25000        XYZ 789   True
2      Ford Focus  27000        CBA 321   True
3         Audi A4  35000        ZYX 987  False
4             NaN  29000        DEF 456  False

EDIT:

For test values in substrings is possible use list comprehension with loop by values in search_for_these_values and test match by in with any for return at least one True:

df['Match'] = [any(x in z for z in search_for_these_values) 
                                if x == x 
                                else False 
                                for x in df["Brand"]]
print (df)

            Brand  Price Liscence Plate  Match
0     Honda Civic  22000        ABC 123  False
1  Toyota Corolla  25000        XYZ 789  False
2      Ford Focus  27000        CBA 321   True
3         Audi A4  35000        ZYX 987   True
4             NaN  29000        DEF 456  False
Sign up to request clarification or add additional context in comments.

4 Comments

Okay great, thank you so much for the help! Another quick question, if I wanted to display true if the entire string in search_for_these_values is within the string in the column, which command would I use. For ex. the string "Audi A4 2019" to display True since the entire "Audi A4" string is within "Audi A4 2019". The desired outcome would be False, False, True, True, False.
hmmm, more complicated like seems.
@BradKlassen - Added solution to answer.
Great thanks for being so helpful, much appreciated!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.