regular expression http in python

Question

I have a long text and I would like to obtain all the entries in the text that match the following pattern:

http******.id.txt, where * could be any entry (unknown length), and the dots are actually dots in the text. I'd like to have a list with all the entries that match such pattern.

One of the many tries was,

c = re.match(r'^(http)(.*)id.txt', b)

I also tried,

c = re.findall(r'(http)(.*)fastq.gz', b)

but none of them give a list of http***.fastq.gz entries.

Thanks!

I strongly suspect that you could be a bit more precise where you're saying that * could be "any entry." Perhaps it could be any number of non-whitespace characters (r'(http\S*)' for example? Or it might be any number of any character other than certain bits of punctuation (r'http[^.,; \t\n]*' for example). Be more specific about how you'd know that you've hit the end of one of these strings and then figure out how to represent that as a regular expression atom. — Jim Dennis
– Jim Dennis, Commented Sep 10, 2013 at 19:50
@JimDennis you are completely right, however i needed a fast solution and the ones they provided ended up solving my problem. But in general it's quite a sloppy definition the one I said, thanks. — Dnaiel
– Dnaiel, Commented Sep 10, 2013 at 20:06

Maxime Lorant · Accepted Answer · 2013-09-10 19:23:42Z

1

Have you tried to use re.findall?

import re

b = 'http://match.id.txt --- blablabla --- http://match2.id.txt'
matches = re.findall(r'http.*?\.id\.txt', b)
print matches  # ['http://match.id.txt', 'http://match2.id.txt']

The ? just after the .* reduce the matching to the minimum (without it, it matches the whole string, with .* = match.id.txt --- blablabla --- http://match2). See some tutorial about regex to know more about it.

Demo available here

edited Sep 10, 2013 at 19:23

answered Sep 10, 2013 at 19:17

Maxime Lorant

36.4k19 gold badges91 silver badges97 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Dnaiel Over a year ago

I tried it as well, but it does not give me the list as you mentioned

Dnaiel Over a year ago

nice, thanks! Still doesn't work for my text since it has a lot more odd character, trying to figure out what's specific about my text

Ofir Israel Over a year ago

Why don't you just share with us a bit of your text?

gkovacs · Accepted Answer · 2013-09-10 19:20:23Z

0

You may scape your dot characters with the character '\', because .(dot) is the regex for any character Example:

c = re.match(r'^(http).*(\.*)id\.txt', b)

answered Sep 10, 2013 at 19:20

gkovacs

1

Collectives™ on Stack Overflow

regular expression http in python

2 Answers 2

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related