1

I have a long text and I would like to obtain all the entries in the text that match the following pattern:

http******.id.txt, where * could be any entry (unknown length), and the dots are actually dots in the text. I'd like to have a list with all the entries that match such pattern.

One of the many tries was,

c = re.match(r'^(http)(.*)id.txt', b) 

I also tried,

c = re.findall(r'(http)(.*)fastq.gz', b)

but none of them give a list of http***.fastq.gz entries.

Thanks!

4
  • What do you mean when you say "it doesn't work" Commented Sep 10, 2013 at 19:18
  • I mean it does not give me the list i want. Commented Sep 10, 2013 at 19:20
  • I strongly suspect that you could be a bit more precise where you're saying that * could be "any entry." Perhaps it could be any number of non-whitespace characters (r'(http\S*)' for example? Or it might be any number of any character other than certain bits of punctuation (r'http[^.,; \t\n]*' for example). Be more specific about how you'd know that you've hit the end of one of these strings and then figure out how to represent that as a regular expression atom. Commented Sep 10, 2013 at 19:50
  • @JimDennis you are completely right, however i needed a fast solution and the ones they provided ended up solving my problem. But in general it's quite a sloppy definition the one I said, thanks. Commented Sep 10, 2013 at 20:06

2 Answers 2

1

Have you tried to use re.findall?

import re

b = 'http://match.id.txt --- blablabla --- http://match2.id.txt'
matches = re.findall(r'http.*?\.id\.txt', b)
print matches  # ['http://match.id.txt', 'http://match2.id.txt']

The ? just after the .* reduce the matching to the minimum (without it, it matches the whole string, with .* = match.id.txt --- blablabla --- http://match2). See some tutorial about regex to know more about it.

Demo available here

Sign up to request clarification or add additional context in comments.

3 Comments

I tried it as well, but it does not give me the list as you mentioned
nice, thanks! Still doesn't work for my text since it has a lot more odd character, trying to figure out what's specific about my text
Why don't you just share with us a bit of your text?
0

You may scape your dot characters with the character '\', because .(dot) is the regex for any character Example:

c = re.match(r'^(http).*(\.*)id\.txt', b) 

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.