Parsing string with repeating pattern with regex in Python?

Question

I read a text file line by line with a Python script. What I get is a list of strings, one string per line. I now need to parse each string into more manageable data (i.e. strings, integers).

The strings look similar to this:

"the description (number)" (e.g. "door (0)")
"the description (number|number|number)" (e.g. "window (1|22|4))
"the description (number|number|number|number)" (e.g. "toilet (2|6|5|10))

Now what I want is a list of split/parsed strings for each line from the text file that I can process further, for instance:

"window (1|22|4)" -> [ "window", "1", "22", "4" ]

I guess regular expressions are the best fit to accomplish this and I already managed to come up with this:

(.+)\s+((\d+)\), which perfectly matches for instance [ “door", "0" ] for "door (0)"

However, some items have more data to parse:

(.+)\s((\d+)+\|\), which matches only [ "window", "1" ] for "window (1|22|4)

How can I repeat the pattern matching for the part (\d+)+\| (i.e "1|") up to the closing parenthesis for an undefined number repetitions of this pattern? The last item to match would be an integer, which could be caught separately with (\d+)\).

Also is there a way to match either the simple or the extended case with a single regular expression?

Thanks! And have a nice weekend, everybody!

RafalS · Accepted Answer · 2019-12-07 18:11:08Z

1

Here's the regex:\w+ \((\d+\|)*\d+\). But imo you should do a mix of regex and str.split

data = []
with open("f.txt") as f:
    for line in f:
        word, numbers = re.search(r"(\w+) \(([^)]+)\)", line).groups()
        data.append((word, *numbers.split("|")))

print(data) # [('door', '0'), ('window', '1', '22', '4')]

edited Dec 7, 2019 at 18:11

answered Dec 7, 2019 at 18:05

RafalS

6,5441 gold badge23 silver badges26 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

An0ther0ne · Accepted Answer · 2019-12-07 20:45:30Z

0

import re
a = [r'door (0)',
    r'window (1|22|4)',
    r'toilet (2|6|5|10)'
]
for i in a: 
    print(re.findall('(\w+)',i))

Result:

['door', '0']
['window', '1', '22', '4']
['toilet', '2', '6', '5', '10']

edited Dec 7, 2019 at 20:45

answered Dec 7, 2019 at 18:08

An0ther0ne

3943 silver badges9 bronze badges

1 Comment

user12097764 Over a year ago

\d is a subset of \w. The second group is never matched. The code is equivalent to for i in a: b = re.findall('(\w+)',i) print ( b )

apraksim · Accepted Answer · 2019-12-23 11:05:50Z

0

Not a raw regex, but another way to extract and process that data can be to use TTP template

from ttp import ttp

template = """
<macro>
def process_matches(data):
    data["numbers"] = data["numbers"].split("|")
    return data
</macro>

<group name="{{ thing }}" macro="process_matches">
{{ thing }} ({{ numbers }})
</group>
"""

data = """
door (0)
window (1|22|4)
toilet (2|6|5|10)
"""

parser = ttp(data, template)
parser.parse()
print(parser.result(format="pprint")[0])

above code would produce

[   {   'door': {'numbers': ['0']},
        'toilet': {'numbers': ['2', '6', '5', '10']},
        'window': {'numbers': ['1', '22', '4']}}]

answered Dec 23, 2019 at 11:05

apraksim

2011 silver badge4 bronze badges

Collectives™ on Stack Overflow

Parsing string with repeating pattern with regex in Python?

3 Answers 3

Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related