0

I want to extract a full URL from a string.

My code is:

import re
data = "ahahahttp://www.google.com/a.jpg>hhdhd"
print re.match(r'(ftp|http)://.*\.(jpg|png)$', data)

Output:

None

Expected Output

http://www.google.com/a.jpg

I found so many questions on StackOverflow, but none worked for me. I have seen many posts and this is not a duplicate. Please help me! Thanks.

1

3 Answers 3

4

You were close!

Try this instead:

r'(ftp|http)://.*\.(jpg|png)'

You can visualize this here.

I would also make this non-greedy like this:

r'(ftp|http)://.*?\.(jpg|png)'

You can visualize this greedy vs. non-greedy behavior here and here.

By default, .* will match as much text as possible, but you want to match as little text as possible.

Your $ anchors the match at the end of the line, but the end of the URL is not the end of the line, in your example.

Another problem is that you're using re.match() and not re.search(). Using re.match() starts the match at the beginning of the string, and re.search() searches anywhere in the string. See here for more information.

Sign up to request clarification or add additional context in comments.

4 Comments

upvote for the visualize. however, it is not valid in some sense. For ex, http://.xjpg.
Thanks! I fixed the visualizations so the original regex correction is shown, as well as the greedy vs. non-greedy match. And yes, this isn't a great regex to match URLs in all forms, but that's answered elsewhere, and my goal is to show what the main problems with OPs regex are for the examples given :)
Thankyou, I got it now!
No problem, glad to help!
1

You should use search instead of match.

import re
data = "ahahahttp://www.google.com/a.jpg>hhdhd"
url=re.search('(ftp|http)://.*\.(jpg|png)', data)
if url:
   print url.group(0)

1 Comment

Thanks, Worked for me
0

Find the start of the url by using find(http:// , ftp://) . Find the end of url using find(jpg , png). Now get the substring

data = "ahahahttp://www.google.com/a.jpg>hhdhd"
start = data.find('http://')
kk = data[start:]
end = kk.find('.jpg')
print kk[0:end+4]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.