85

Considering a string as follows:

string = "<p>Hello World</p><a href="http://example.com">More Examples</a><a href="http://2.example">Even More Examples</a>"

How could I, with Python, extract the URLs, inside the anchor tag's href? Something like:

>>> url = getURLs(string)
>>> url
['http://example.com', 'http://2.example']
3
  • 3
    Don't try to parse HTML with regexp. Look for a HTML parser, that can extract the href value for you. Commented Jul 30, 2011 at 12:28
  • 1
    @Judge John Deed: better be lazy. Commented Jul 30, 2011 at 13:02
  • 2
    See: stackoverflow.com/questions/9760588/… Commented Aug 11, 2015 at 21:43

2 Answers 2

181
import re

url = '<p>Hello World</p><a href="http://example.com">More Examples</a><a href="http://2.example">Even More Examples</a>'

urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', url)

>>> print urls
['http://example.com', 'http://2.example']
Sign up to request clarification or add additional context in comments.

12 Comments

In any sort of normal scraping where the text portion of the href is also a link rather than descriptive text, this just gives duplicates.
For those modifying this regex, note that the '-' in the [$-_@.&+] is acting as a range operator and not a character. This means certain chairs (e.g., ',') are represented more than once.
This regex does not consider URL fragments (the # suffix).
How can this be used to catch URLs without http ? Like www.google.com or google.com
It doesn't work for following TEXT: "http://lubimyczytac.pl/ksiazka/57710/nowy-umysl-cesarza-o-komputerach-umysle-i-prawach-fizyki':"
|
68

The best answer is...

Don't use a regex

The expression in the accepted answer misses many cases. Among other things, URLs can have unicode characters in them. The regex you want is here, and after looking at it, you may conclude that you don't really want it after all. The most correct version is ten-thousand characters long.

Admittedly, if you were starting with plain, unstructured text with a bunch of URLs in it, then you might need that ten-thousand-character-long regex. But if your input is structured, use the structure. Your stated aim is to "extract the URL, inside the anchor tag's href." Why use a ten-thousand-character-long regex when you can do something much simpler?

Parse the HTML instead

For many tasks, using Beautiful Soup will be far faster and easier to use:

>>> from bs4 import BeautifulSoup as Soup
>>> html = Soup(s, 'html.parser')           # Soup(s, 'lxml') if lxml is installed
>>> [a['href'] for a in html.find_all('a')]
['http://example.com', 'http://2.example']

If you prefer not to use external tools, you can also directly use Python's own built-in HTML parsing library. Here's a really simple subclass of HTMLParser that does exactly what you want:

from html.parser import HTMLParser

class MyParser(HTMLParser):
    def __init__(self, output_list=None):
        HTMLParser.__init__(self)
        if output_list is None:
            self.output_list = []
        else:
            self.output_list = output_list
    def handle_starttag(self, tag, attrs):
        if tag == 'a':
            self.output_list.append(dict(attrs).get('href'))

Test:

>>> p = MyParser()
>>> p.feed(s)
>>> p.output_list
['http://example.com', 'http://2.example']

You could even create a new method that accepts a string, calls feed, and returns output_list. This is a vastly more powerful and extensible way than regular expressions to extract information from HTML.

8 Comments

What's all the __init__ and self stuff?
Beautiful is great is you need to parse href or src as asked in the initial question and should be the accepted answer, but beware it won't help to find URL in strings.
This doesn't answer the question, though. The question is about the format of URLs, not how to parse HTML.
@AlSweigart, I think it's reasonable to say that the body of the question asks about parsing HTML.
@AlSweigart, thanks for editing the title. I was thinking about this and realized that by my own logic, I should actually edit the title. Then I saw that you had done so already!
|