1

I want to extract the following strings from the title column and append to a new column named hazard_extract like in the below example.

test = {'title': ['Other', 'Microbiological - Listeria', 'Extraneous Material', 'Chemical', 'Chemical - Histamine', 'Labelling, Other'], 'hazard_extract':['Other', 'Microbiological', 'Extraneous Material', 'Chemical', 'Chemical', 'Labelling']}
example = pd.DataFrame(test)
example

    title                       hazard_extract
0   Other                       Other
1   Microbiological - Listeria  Microbiological
2   Extraneous Material         Extraneous Material
3   Chemical                    Chemical
4   Chemical - Histamine        Chemical
5   Labelling, Other            Labelling

However, I am using the code below - if the string does not have a - or , it does not extract the string. In this case, how can I extract both words as in Extraneous Material and a single word as in Chemical or Other?

example['hazard_extract'] = example['title'].str.extract(r'^(.*?),? ')
    title                       hazard_extract
0   Other                       NaN
1   Microbiological - Listeria  Microbiological
2   Extraneous Material         Extraneous
3   Chemical                    NaN
4   Chemical - Histamine        Chemical
5   Labelling, Other            Labelling

Thank you so much for all the help!

1
  • so if there is a dash/comma, you want the part before the dash/comma; otherwise you want the full original string? Commented Mar 15, 2021 at 4:30

3 Answers 3

1

The easiest will be to use split

example['title'].str.split(r'[-,]').str[0].str.strip()
0                  Other
1       Microbiological 
2    Extraneous Material
3               Chemical
4              Chemical 
5              Labelling
Sign up to request clarification or add additional context in comments.

2 Comments

maybe add .str.strip()?
I was thinking about that but since OP needs the first element and - , comes right after the first word, though not required
1

No need for a complicated regular expression:

import pandas as pd

test = {'title': ['Other', 'Microbiological - Listeria', 'Extraneous Material', 'Chemical', 'Chemical - Histamine', 'Labelling, Other']}
example = pd.DataFrame(test)
print(example)
print()
example['hazard_extract'] = example['title'].str.split(' -|,').str[0]
print(example)
                        title
0                       Other
1  Microbiological - Listeria
2         Extraneous Material
3                    Chemical
4        Chemical - Histamine
5            Labelling, Other

                        title       hazard_extract
0                       Other                Other
1  Microbiological - Listeria      Microbiological
2         Extraneous Material  Extraneous Material
3                    Chemical             Chemical
4        Chemical - Histamine             Chemical
5            Labelling, Other            Labelling

Comments

0

Try this:

example['title'].str.extract(r'^(\w*\s*\w*)\s*[\,\-]?.*')

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.