I have been trying to read a custom csv file like this:
6 Rotterdam NLD Zuid-Holland 593321
19 Zaanstad NLD Noord-Holland 135621
214 Porto Alegre BRA Rio Grande do Sul 1314032
397 Lauro de Freitas BRA Bahia 109236
547 Dobric BGR Varna 100399
552 Bujumbura BDI Bujumbura 300000
554 Santiago de Chile CHL Santiago 4703954
626 al-Minya EGY al-Minya 201360
646 Santa Ana SLV Santa Ana 139389
762 Bahir Dar ETH Amhara 96140
123 Chicago 10000
222 New York 200000
I tried regex in https://regex101.com/ Following code works:
this works
# https://regex101.com/
s = "6 Rotterdam NLD Zuid-Holland 593321 "
pat = r'(\d+)\s+([\D]+)\s(\d+)\s+'
m = re.match(pat,s)
m.groups() # ('6', 'Rotterdam NLD Zuid-Holland', '593321')
I got the correct answer, but when I applied the code to pandas read_csv, somehow it failed to work.
my attempt
import numpy as np
import pandas as pd
from io import StringIO
s = """6 Rotterdam NLD Zuid-Holland 593321
19 Zaanstad NLD Noord-Holland 135621
214 Porto Alegre BRA Rio Grande do Sul 1314032
397 Lauro de Freitas BRA Bahia 109236
547 Dobric BGR Varna 100399
552 Bujumbura BDI Bujumbura 300000
554 Santiago de Chile CHL Santiago 4703954
626 al-Minya EGY al-Minya 201360
646 Santa Ana SLV Santa Ana 139389
762 Bahir Dar ETH Amhara 96140
123 Chicago 10000
222 New York 200000 """;
sep = r'(\d+)\s+|([\D]+)\s+|(\d+)\s+'
df = pd.read_csv(StringIO(s), sep=sep,engine='python')
df
I get a lot of Nans, how to get only 3 columns?
Column names are: ID CITY POPULATION
sep = r'(\d+)\s+([\D]+)\s+(\d+)\s+'it gives only one column.sep = r'(?:(?<=^\d)|(?<=^\d{2})|(?<=^\d{3}))\s+|\s+(?=\S+\s*$)'