2

I have multiple strings which looks like this product: green apples price: 2.0 country: france company: somecompany. Some strings might have fewer fields. For example some are missing company name or country etc. I am trying to extract values only and skip product,price,country,company. I tried to create multiple regexes, which starts from the left side of each string.

blah="product: green apples price: 2.0 country: france company: somecompany"

product_reg = re.compile(r'.*?\bproduct\b:(.*).*')
product_reg_strip = re.compile(r'(.*?)\s[a-z]:?')

product_full=re.findall(product_reg, blah)
prod=re.find(product_reg_strip, str(product_full))
print prod

price_reg = re.compile(r'.*?\bprice\b:(.*).*')
price_reg_strip = re.compile(r'(.*?)\s[a-z]:?')

price_full=re.findall(price_reg, blah)
price=re.find(price_reg_strip, str(price_full))
print price

But this is not working. What should i do to make this regex more sane?

2
  • Is price the only numerical value in each of the strings? Commented Apr 20, 2017 at 16:08
  • What do you want the output to be? In your example, is it green apples 2.0 france somecompany? Commented Apr 20, 2017 at 16:11

3 Answers 3

2

You can use simply a regexp and get named group results. You also can have or not all the values as you asked, the regexp works fine in all cases. Try using this global multiline regexp on regex101.com https://regex101.com/r/iccVUv/1/:

^(?:product:(?P<product>.*?))(?:price:(?P<price>.*?))?(?:country:(?P<country>.*?))?(?:company:(?P<company>.*))?$

In python you can, for example do this:

pattern = '^(?:product:(?P<product>.*?))(?:price:(?P<price>.*?))?(?:country:(?P<country>.*?))?(?:company:(?P<company>.*))?$'
matches = re.search(pattern, 'product: green apples price: 2.0 country: italy company: italian company')

Now you can get data simply using:

product = matches.group('product')

You finally need only to check if the match is satisfacted and trim spaces like:

if matches1.group('product') is not None:
  product = matches.group('product').strip()
Sign up to request clarification or add additional context in comments.

Comments

1

You could split the string like that:

str = "product: green apples price: 2.0 country: france company: somecompany"
p = re.compile(r'(\w+:)')
res = p.split(str)
print res
for i in range(len(res)):
    if (i%2):
        print res[i],' ==> ',res[i+1]

Output:

['', 'product:', ' green apples ', 'price:', ' 2.0 ', 'country:', ' france ', 'company:', ' somecompany']

product:  ==>   green apples 
price:  ==>   2.0 
country:  ==>   france 
company:  ==>   somecompany

Comments

0

I'm not completely sure what you are after, but if the things you want to remove are a single word followed by a colon, the regex is pretty easy. Here are a couple of samples.

>>> import re
>>> blah="product: green apples price: 2.0 country: france company: somecompany"
>>> re.sub(r'\w+: ?', '', blah)
'green apples 2.0 france somecompany'
>>> re.split(r'\w+: ?', blah)[1:]
['green apples ', '2.0 ', 'france ', 'somecompany']

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.