I'm doing text analysis now. My task is to count how many times each 'bad word' in a list appears in a string in a dataframe column. What I can think of is to use .isin() or .contains() to check word by word. But the length of the word list is over 40000. So the loop will be too slow. Is there a better way to do this?
-
1Go the opposite way - split the string into words and check if the words are in the "naughty list", keeping track of counts (and catching a lot of value errors resulting from words not being in there). This should be significantly faster if the strings aren't too longLukas Thaler– Lukas Thaler2019-11-14 14:22:53 +00:00Commented Nov 14, 2019 at 14:22
-
Use a database.Pedro Lobito– Pedro Lobito2019-11-14 14:25:01 +00:00Commented Nov 14, 2019 at 14:25
Add a comment
|
1 Answer
While you said that a loop might be too slow it does seem like the most efficient way due to the extent of the list. Tried to keep it as simple as possible. Feel free to modify the print statement based on your needs.
text = 'Bad Word test for Terrible Word same as Horrible Word and NSFW Word and Bad Word again'
bad_words = ['Bad Word', 'Terrible Word', 'Horrible Word', 'NSFW Word']
length_list = []
for i in bad_words:
count = text.count(i)
length_list.append([i, count])
print(length_list)
output:
[['Bad Word', 2], ['Terrible Word', 1], ['Horrible Word', 1], ['NSFW Word', 1]]
Alternatively your output as a string can be:
length_list = []
for i in bad_words:
count = text.count(i)
print(i + ' count: ' + str(count))
Output:
Bad Word count: 2
Terrible Word count: 1
Horrible Word count: 1
NSFW Word count: 1