2

I am trying to check if a string is in a Pandas column. I tried doing it two ways but they both seem to check for a substring.

itemName = "eco drum ecommerce"
words = self.itemName.split(" ")
df.columns = ['key','word','umbrella', 'freq']
df = df.dropna()
df = df.loc[df['word'].isin(words)]

I also tried this way, but this also checks for substring

words = self.itemName.split(" ")
words = '|'.join(words)
df.columns = ['key','word','umbrella', 'freq']
df = df.dropna()
df = df.loc[df['word'].str.contains(words, case=False)]

The word was this: "eco drum".

Then I did this:

words = self.itemName.split(" ")
words = '|'.join(words)

To end up with this:

eco|drum

This is the "word" column:

enter image description here

Thank you, is it possible this way to not match substrings?

1 Answer 1

3

You have the right idea. .contains has the regex pattern match option set to True by default. Therefore all you need to do is add anchors to your regex pattern e.g. "ball" will become "^ball$".

df = pd.DataFrame(columns=['key'])
df["key"] = ["largeball", "ball", "john", "smallball", "Ball"]
print(df.loc[df['key'].str.contains("^ball$", case=False)])

Referring more specifically to your question, since you want to search for multiple words, you will have to create the regex pattern to give to contains.

# Create dataframe
df = pd.DataFrame(columns=['word'])
df["word"] = ["ecommerce", "ecommerce", "ecommerce", "ecommerce", "eco", "drum"]
# Create regex pattern
word = "eco drum"
words = word.split(" ")
words = "|".join("^{}$".format(word) for word in words)
# Find matches in dataframe
print(df.loc[df['word'].str.contains(words, case=False)])

The code words = "|".join("^{}$".format(word) for word in words) is referred to as a generator expression. Given ['eco', 'drum'] it will return this pattern: ^eco$|^drum$.

Sign up to request clarification or add additional context in comments.

5 Comments

hey @the-realtom, not on my desktop right now, so i will try it when i get home. So you are saying, in this situation where the regex pattern is a variable, i will do something like this df = df.loc[df['word'].str.contains("^words$", case=False)] Thank you, it seems, This is the right track
hey @the-realtom i tried doing something like this, but the new pandas dataframe is empty df = df.loc[df['word'].str.contains('^words$', case=False)]
I updated my answer, I assume words is a string of a single word?
hey @the-realtom word is something like this "eco drum" Then i did this words = self.itemName.split(" ") words = '|'.join(words) to end up with this eco|drum Thank you, is it possible this way? Thank you I will add it in my original message to make it clearer.
My answer has been updated, let me know if you have any questions.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.