Pandas read csv with regex separator

Question

I have been trying to read a custom csv file like this:

6 Rotterdam NLD Zuid-Holland 593321 
19 Zaanstad NLD Noord-Holland 135621 
214 Porto Alegre BRA Rio Grande do Sul 1314032 
397 Lauro de Freitas BRA Bahia 109236 
547 Dobric BGR Varna 100399 
552 Bujumbura BDI Bujumbura 300000 
554 Santiago de Chile CHL Santiago 4703954 
626 al-Minya EGY al-Minya 201360 
646 Santa Ana SLV Santa Ana 139389 
762 Bahir Dar ETH Amhara 96140 
123 Chicago 10000 
222 New York 200000

I tried regex in https://regex101.com/ Following code works:

this works

# https://regex101.com/
s = "6 Rotterdam NLD Zuid-Holland 593321 "
pat = r'(\d+)\s+([\D]+)\s(\d+)\s+'

m = re.match(pat,s)
m.groups() # ('6', 'Rotterdam NLD Zuid-Holland', '593321')

I got the correct answer, but when I applied the code to pandas read_csv, somehow it failed to work.

my attempt

import numpy as np
import pandas as pd
from io import StringIO

s = """6 Rotterdam NLD Zuid-Holland 593321 
19 Zaanstad NLD Noord-Holland 135621 
214 Porto Alegre BRA Rio Grande do Sul 1314032 
397 Lauro de Freitas BRA Bahia 109236 
547 Dobric BGR Varna 100399 
552 Bujumbura BDI Bujumbura 300000 
554 Santiago de Chile CHL Santiago 4703954 
626 al-Minya EGY al-Minya 201360 
646 Santa Ana SLV Santa Ana 139389 
762 Bahir Dar ETH Amhara 96140 
123 Chicago 10000 
222 New York 200000  """;

sep = r'(\d+)\s+|([\D]+)\s+|(\d+)\s+'
df = pd.read_csv(StringIO(s), sep=sep,engine='python')
df

I get a lot of Nans, how to get only 3 columns?

Column names are: ID CITY POPULATION

similar questions

How to read the custom table in pandas which has number string number number?

You used the pattern to match (extract) text but in the pandas method you are splitting with the pattern. How many digits can there be at the start of each line? — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Apr 10, 2020 at 14:45
@WiktorStribiżew I tried sep = r'(\d+)\s+([\D]+)\s+(\d+)\s+' it gives only one column. — BhishanPoudel
– BhishanPoudel, Commented Apr 10, 2020 at 14:46
If there can be 1, 2 or 3 digits only at the start of each line, try sep = r'(?:(?<=^\d)|(?<=^\d{2})|(?<=^\d{3}))\s+|\s+(?=\S+\s*$)' — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Apr 10, 2020 at 14:47
@WiktorStribiżew The question was only a part of my dataset. Now, it worked for my full dataset, if you post as an answer, I will gladly accept as an answer. — BhishanPoudel
– BhishanPoudel, Commented Apr 10, 2020 at 14:49

Wiktor Stribiżew · Accepted Answer · 2020-04-10 14:50:23Z

2

You used the pattern to match (extract) text but in the pandas method you are splitting with the pattern.

If there can be 1, 2 or 3 digits only at the start of each line, use

sep = r'(?:(?<=^\d)|(?<=^\d{2})|(?<=^\d{3}))\s+|\s+(?=\S+\s*$)'

See the regex demo. You may extend it by adding more lookbehinds in the first non-capturing group.

Details

(?:(?<=^\d)|(?<=^\d{2})|(?<=^\d{3}))\s+ - 1+ whitespaces (\s+) that are preceded with 1 digit (\d), or 2 digits (\d{2}), or 3 digits (\d{3}) at the start of string (^)
| - or
\s+(?=\S+\s*$) - 1+ whitespaces followed with 1+ non-whitespace chars and then any trailing 0+ whitespaces before end of string.

answered Apr 10, 2020 at 14:50

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Lydia van Dyke · Accepted Answer · 2020-04-10 15:19:30Z

Just to provide an alternative solution without the use of regular expressions:

You could also do the parsing of the text file in plain Python. In some cases this may be easier to maintain than a rather complex regular expression.

For this specific format we know that the first and the last number in every row have a special meaning. So I would have used split and rsplit to pick them.

import pandas as pd
from io import StringIO

s = """6 Rotterdam NLD Zuid-Holland 593321 
19 Zaanstad NLD Noord-Holland 135621 
214 Porto Alegre BRA Rio Grande do Sul 1314032 
397 Lauro de Freitas BRA Bahia 109236 
547 Dobric BGR Varna 100399 
552 Bujumbura BDI Bujumbura 300000 
554 Santiago de Chile CHL Santiago 4703954 
626 al-Minya EGY al-Minya 201360 
646 Santa Ana SLV Santa Ana 139389 
762 Bahir Dar ETH Amhara 96140 
123 Chicago 10000 
222 New York 200000  """

data = []
for line in StringIO(s):
    line = line.strip()
    if not line:
        continue
    id_value, line = line.split(" ", 1)
    city, population = line.rsplit(" ", 1)

    data.append((id_value, city, population))

df = pd.DataFrame(data, columns=["id", "city", "population"])
df["id"] = pd.to_numeric(df["id"])
df["population"] = pd.to_numeric(df["population"])
print(df)

I did not do any speed measurements. Depending on the file size however, speed might not be an issue at all, anyhow. But even if it is: I would have used this script to preprocess the data first (and one time only) in order to be able to use regular old pd.read_csv without additional arguments needed.

Collectives™ on Stack Overflow

Pandas read csv with regex separator

this works

my attempt

similar questions

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

this works

my attempt

similar questions

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related