1

I have a csv database of tweets, which I need to search for a list of specific phrases and words. For example, I'm searching for "global warming". I want to find not only "global warming", but also "Global warming", "Global Warming", "#globalwarming", "#Globalwarming", "#GlobalWarming", etc. So, all the possible forms.

How could I implement regex into my code to do that? Or maybe there's another solution?

with open('filedirectory.csv', 'w', newline='') as output_file:
    writer = csv.writer(output_file)

    with open('filedirectory1.csv', 'w', newline='') as output_file2:
        writer2 = csv.writer(output_file2)

        with open('filedirectory2.csv') as csv_file:
          csv_read = csv.reader(csv_file)

          for row in csv_read:

                search_terms = ["global warming", "GLOBAL WARMING", etc.]

                if any([term in row[2] for term in search_terms]):
                   writer.writerow(row)

                else:
                   writer2.writerow(row) ``


2
  • you can skip the upper and lowercase by forcing it: row = row.lower() for instance. Then the regex would be something along those lines: #?global\s*warming Commented Dec 4, 2019 at 10:08
  • Building up a regex matching all the forms you gave is possible. Have a look at this website it is very helpful. I would suggest a case insensitive regex, making use of optional characters (# and space) between global and warming. Commented Dec 4, 2019 at 10:08

1 Answer 1

1

You can use your own code with very simple modification

...

for row in csv_read:
    row_lower = row.lower()
    search_terms = ["global warming", "globalwarming"]

    if any([term in row_lower for term in search_terms]):
        writer.writerow(row)
    else:
        writer2.writerow(row)

If you must use regex or you are afraid to miss some rows such as : "...global(more than one space)warming...", "..global____warming..", "..global serious warming.."

...

global_regex = re.compile(r'global.*?warming', re.IGNORECASE)
for row in csv_read:            

        if any(re.findall(global_regex, row)):
           writer.writerow(row)
        else:
           writer2.writerow(row)

I compiled the regex outside the loop for better performance.

Here you can see the regex in action.

Sign up to request clarification or add additional context in comments.

2 Comments

Thank you! Just had a chance to try it out. Complains that "expected string or bytes-like object"...
Which code snippet gives this error? Also can you paste the exact error please?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.