extract URL from string in python

Question

I want to extract a full URL from a string.

My code is:

import re
data = "ahahahttp://www.google.com/a.jpg>hhdhd"
print re.match(r'(ftp|http)://.*\.(jpg|png)$', data)

Output:

None

Expected Output

http://www.google.com/a.jpg

I found so many questions on StackOverflow, but none worked for me. I have seen many posts and this is not a duplicate. Please help me! Thanks.

This has been answered lots of time, e.g. stackoverflow.com/questions/6883049/… — apotry
– apotry, Commented Feb 5, 2016 at 7:55

Will · Accepted Answer · 2016-02-05 08:16:58Z

4

You were close!

Try this instead:

r'(ftp|http)://.*\.(jpg|png)'

You can visualize this here.

I would also make this non-greedy like this:

r'(ftp|http)://.*?\.(jpg|png)'

You can visualize this greedy vs. non-greedy behavior here and here.

By default, .* will match as much text as possible, but you want to match as little text as possible.

Your $ anchors the match at the end of the line, but the end of the URL is not the end of the line, in your example.

Another problem is that you're using re.match() and not re.search(). Using re.match() starts the match at the beginning of the string, and re.search() searches anywhere in the string. See here for more information.

edited Feb 5, 2016 at 8:16

answered Feb 5, 2016 at 8:06

Will

24.8k14 gold badges100 silver badges111 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

AlexWei Over a year ago

upvote for the visualize. however, it is not valid in some sense. For ex, http://.xjpg.

Will Over a year ago

Thanks! I fixed the visualizations so the original regex correction is shown, as well as the greedy vs. non-greedy match. And yes, this isn't a great regex to match URLs in all forms, but that's answered elsewhere, and my goal is to show what the main problems with OPs regex are for the examples given :)

shiv shankar Over a year ago

Thankyou, I got it now!

Will Over a year ago

No problem, glad to help!

Wang Wei Qiang · Accepted Answer · 2016-02-05 08:15:47Z

1

You should use search instead of match.

import re
data = "ahahahttp://www.google.com/a.jpg>hhdhd"
url=re.search('(ftp|http)://.*\.(jpg|png)', data)
if url:
   print url.group(0)

answered Feb 5, 2016 at 8:15

Wang Wei Qiang

662 bronze badges

1 Comment

shiv shankar Over a year ago

Thanks, Worked for me

Sai Sriharsha Annepu · Accepted Answer · 2016-02-05 08:12:34Z

0

Find the start of the url by using find(http:// , ftp://) . Find the end of url using find(jpg , png). Now get the substring

data = "ahahahttp://www.google.com/a.jpg>hhdhd"
start = data.find('http://')
kk = data[start:]
end = kk.find('.jpg')
print kk[0:end+4]

edited Feb 5, 2016 at 8:12

answered Feb 5, 2016 at 8:05

Sai Sriharsha Annepu

828 bronze badges

Collectives™ on Stack Overflow

extract URL from string in python

3 Answers 3

4 Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related