1

I read a text file line by line with a Python script. What I get is a list of strings, one string per line. I now need to parse each string into more manageable data (i.e. strings, integers).

The strings look similar to this:

  • "the description (number)" (e.g. "door (0)")
  • "the description (number|number|number)" (e.g. "window (1|22|4))
  • "the description (number|number|number|number)" (e.g. "toilet (2|6|5|10))

Now what I want is a list of split/parsed strings for each line from the text file that I can process further, for instance:

  • "window (1|22|4)" -> [ "window", "1", "22", "4" ]

I guess regular expressions are the best fit to accomplish this and I already managed to come up with this:

  • (.+)\s+((\d+)\), which perfectly matches for instance [ “door", "0" ] for "door (0)"

However, some items have more data to parse:

  • (.+)\s((\d+)+\|\), which matches only [ "window", "1" ] for "window (1|22|4)

How can I repeat the pattern matching for the part (\d+)+\| (i.e "1|") up to the closing parenthesis for an undefined number repetitions of this pattern? The last item to match would be an integer, which could be caught separately with (\d+)\).

Also is there a way to match either the simple or the extended case with a single regular expression?

Thanks! And have a nice weekend, everybody!

3 Answers 3

1

Here's the regex:\w+ \((\d+\|)*\d+\). But imo you should do a mix of regex and str.split

data = []
with open("f.txt") as f:
    for line in f:
        word, numbers = re.search(r"(\w+) \(([^)]+)\)", line).groups()
        data.append((word, *numbers.split("|")))

print(data) # [('door', '0'), ('window', '1', '22', '4')]
Sign up to request clarification or add additional context in comments.

Comments

0
import re
a = [r'door (0)',
    r'window (1|22|4)',
    r'toilet (2|6|5|10)'
]
for i in a: 
    print(re.findall('(\w+)',i))

Result:

['door', '0']
['window', '1', '22', '4']
['toilet', '2', '6', '5', '10']

1 Comment

\d is a subset of \w. The second group is never matched. The code is equivalent to for i in a: b = re.findall('(\w+)',i) print ( b )
0

Not a raw regex, but another way to extract and process that data can be to use TTP template

from ttp import ttp

template = """
<macro>
def process_matches(data):
    data["numbers"] = data["numbers"].split("|")
    return data
</macro>

<group name="{{ thing }}" macro="process_matches">
{{ thing }} ({{ numbers }})
</group>
"""

data = """
door (0)
window (1|22|4)
toilet (2|6|5|10)
"""

parser = ttp(data, template)
parser.parse()
print(parser.result(format="pprint")[0])

above code would produce

[   {   'door': {'numbers': ['0']},
        'toilet': {'numbers': ['2', '6', '5', '10']},
        'window': {'numbers': ['1', '22', '4']}}]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.