Python using regex to extract parts of a string in pandas column

Question

I've got a pandas df column called 'Raw' for which the format is inconsistent. The strings it contains look like that:

'(1T XXX, Europe)'
'(2T YYYY, Latin America)'
'(3T ZZ/ZZZZ, Europe)'
'(4T XXX XXX, Africa)'

The only thing consistent in the strings in 'Raw' is that they start with a digit, includes a comma in the middle followed by a whitespace, and they contain parentheses as well.

Now, I'd like to create two extra columns (Model and Region) in my dataframe:

'Model' would contain the beginning of the string, i.e. everything between the first parenthesis and the comma
'Region' would contain the end of the string, i.e. everything between the whitespace after the comma and the final parenthesis

How do I do that using regex?

Ken Wei · Accepted Answer · 2017-07-05 09:39:38Z

5

Since there's only one comma, and everything is between parentheses, in your case, use .str.split() instead, after slicing appropriately:

model_region = df.Raw.str[1:-1].str.split(', ', expand = True)

But if you insist:

model_region = df.Raw.str.extract('\((.*), (.*)\)', expand = True)

Then

df['Model'] = model_region[0]
df['Region'] = model_region[1]

edited Jul 5, 2017 at 9:39

answered Jul 5, 2017 at 9:32

Ken Wei

3,1381 gold badge12 silver badges31 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Esteban · Accepted Answer · 2017-07-05 09:32:16Z

1

Try this : \(([^,]*), ([^)]*)\)

See : https://regex101.com/r/fCetWg/1

answered Jul 5, 2017 at 9:32

Esteban

1,8151 gold badge10 silver badges17 bronze badges

Comments

K. Kirsz · Accepted Answer · 2017-07-05 09:35:33Z

0

import re

s = '(3T ZZ/ZZZZ, Europe)'
m=re.search(r'\((.*), (.*)\)',s)
print(m.groups())

answered Jul 5, 2017 at 9:35

K. Kirsz

1,42011 silver badges12 bronze badges

Comments

Sudarshan shenoy · Accepted Answer · 2017-07-05 09:39:39Z

0

Model=re.findall(r"(?<=\().+(?=\,)",s)
Region=re.findall(r"(?<=\, ).+(?=\))",s)

The first regex checks for opening bracket "(" in front of the model and closing ",". The second regex checks for any string between "," and ")".

answered Jul 5, 2017 at 9:39

Sudarshan shenoy

271 silver badge5 bronze badges

Comments

Akshay Kandul · Accepted Answer · 2017-07-05 09:42:53Z

0

string_list = ['(1T XXX, Europe)',
'(2T YYYY, Latin America)',
'(3T ZZ/ZZZZ, Europe)',
'(4T XXX XXX, Africa)']
df = pd.DataFrame(string_list)
df = df[0].str.extract("\(([^,]*), ([^)]*)\)", expand=False)

answered Jul 5, 2017 at 9:42

Akshay Kandul

6125 silver badges10 bronze badges

Comments

felix the cat · Accepted Answer · 2017-07-05 09:47:09Z

0

If the comma is a reliable separator of your string parts, then you do not need regexp. If df is your dataframe:

df['Model'] = [x.split(',')[0].replace('(', '') for x in df['Raw']]
df['Region'] = [x.split(',')[1].replace(')', '') for x in df['Raw']]

if you want to use regexp is would look something like:

s = '(1T XXX, Europe)'
m = re.match('\(([\w\s]+),([\w\s]+)\)', s)
model = m.group(1)
region = m.group(2)

answered Jul 5, 2017 at 9:47

felix the cat

1652 silver badges9 bronze badges

Comments

Karn Kumar · Accepted Answer · 2021-07-12 18:08:57Z

Simply you can try below:

Sample DataFrame:

df
                        raw
0          (1T XXX, Europe)
1  (2T YYYY, Latin America)
2      (3T ZZ/ZZZZ, Europe)
3      (4T XXX XXX, Africa)

Solution 1:

using str.extract with regex.

df = df.raw.str.extract(r'\((.*), (.*)\)').rename(columns={0:'Model', 1:'Region'})
print(df)
        Model         Region
0      1T XXX         Europe
1     2T YYYY  Latin America
2  3T ZZ/ZZZZ         Europe
3  4T XXX XXX         Africa

Solution 2:

str.replace() + str.split() with rename.

df = df.raw.str.replace('[(|)]' , '').str.split(',', expand=True).rename(columns={0:'Model', 1:'Region'})
print(df)
        Model          Region
0      1T XXX          Europe
1     2T YYYY   Latin America
2  3T ZZ/ZZZZ          Europe
3  4T XXX XXX          Africa

Note:

However, if you want to retain the original Column as well then, you can opt the below method:

df[['Model', 'Region' ]] = df.raw.str.replace('[(|)]' , '').str.split(',', expand=True)

print(df)
                        raw       Model          Region
0          (1T XXX, Europe)      1T XXX          Europe
1  (2T YYYY, Latin America)     2T YYYY   Latin America
2      (3T ZZ/ZZZZ, Europe)  3T ZZ/ZZZZ          Europe
3      (4T XXX XXX, Africa)  4T XXX XXX          Africa

OR

df[['Model', 'Region' ]] = df.raw.str.extract(r'\((.*), (.*)\)')
print(df)
                        raw       Model         Region
0          (1T XXX, Europe)      1T XXX         Europe
1  (2T YYYY, Latin America)     2T YYYY  Latin America
2      (3T ZZ/ZZZZ, Europe)  3T ZZ/ZZZZ         Europe
3      (4T XXX XXX, Africa)  4T XXX XXX         Africa

Collectives™ on Stack Overflow

Python using regex to extract parts of a string in pandas column

7 Answers 7

Comments

Comments

Comments

Comments

Comments

Comments

Sample DataFrame:

Solution 1:

Solution 2:

Note:

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

Comments

Comments

Comments

Comments

Comments

Comments

Sample DataFrame:

Solution 1:

Solution 2:

Note:

Comments

Your Answer

Sign up or log in

Post as a guest

Related