Regex to extract URLs from href attribute in HTML with Python [duplicate]

Question

Considering a string as follows:

string = "<p>Hello World</p><a href="http://example.com">More Examples</a><a href="http://2.example">Even More Examples</a>"

How could I, with Python, extract the URLs, inside the anchor tag's href? Something like:

>>> url = getURLs(string)
>>> url
['http://example.com', 'http://2.example']

Don't try to parse HTML with regexp. Look for a HTML parser, that can extract the href value for you. — Anders Lindahl
– Anders Lindahl, Commented Jul 30, 2011 at 12:28

Stephen Ostermiller · Accepted Answer · 2022-06-25 16:33:58Z

181

import re

url = '<p>Hello World</p><a href="http://example.com">More Examples</a><a href="http://2.example">Even More Examples</a>'

urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', url)

>>> print urls
['http://example.com', 'http://2.example']

edited Jun 25, 2022 at 16:33

Stephen Ostermiller♦

25.8k18 gold badges96 silver badges117 bronze badges

answered Jul 30, 2011 at 12:26

JohnJohnGa

15.7k20 gold badges65 silver badges87 bronze badges

Sign up to request clarification or add additional context in comments.

12 Comments

Ryan Over a year ago

In any sort of normal scraping where the text portion of the href is also a link rather than descriptive text, this just gives duplicates.

John Lehmann Over a year ago

For those modifying this regex, note that the '-' in the [$-_@.&+] is acting as a range operator and not a character. This means certain chairs (e.g., ',') are represented more than once.

Dr. Jan-Philip Gehrcke Over a year ago

This regex does not consider URL fragments (the # suffix).

Matthew Moisen Over a year ago

How can this be used to catch URLs without http ? Like www.google.com or google.com

Netrix Over a year ago

It doesn't work for following TEXT: "http://lubimyczytac.pl/ksiazka/57710/nowy-umysl-cesarza-o-komputerach-umysle-i-prawach-fizyki':"

|

Stephen Ostermiller · Accepted Answer · 2022-06-25 16:34:31Z

68

The best answer is...

Don't use a regex

The expression in the accepted answer misses many cases. Among other things, URLs can have unicode characters in them. The regex you want is here, and after looking at it, you may conclude that you don't really want it after all. The most correct version is ten-thousand characters long.

Admittedly, if you were starting with plain, unstructured text with a bunch of URLs in it, then you might need that ten-thousand-character-long regex. But if your input is structured, use the structure. Your stated aim is to "extract the URL, inside the anchor tag's href." Why use a ten-thousand-character-long regex when you can do something much simpler?

Parse the HTML instead

For many tasks, using Beautiful Soup will be far faster and easier to use:

>>> from bs4 import BeautifulSoup as Soup
>>> html = Soup(s, 'html.parser')           # Soup(s, 'lxml') if lxml is installed
>>> [a['href'] for a in html.find_all('a')]
['http://example.com', 'http://2.example']

If you prefer not to use external tools, you can also directly use Python's own built-in HTML parsing library. Here's a really simple subclass of HTMLParser that does exactly what you want:

from html.parser import HTMLParser

class MyParser(HTMLParser):
    def __init__(self, output_list=None):
        HTMLParser.__init__(self)
        if output_list is None:
            self.output_list = []
        else:
            self.output_list = output_list
    def handle_starttag(self, tag, attrs):
        if tag == 'a':
            self.output_list.append(dict(attrs).get('href'))

Test:

>>> p = MyParser()
>>> p.feed(s)
>>> p.output_list
['http://example.com', 'http://2.example']

You could even create a new method that accepts a string, calls feed, and returns output_list. This is a vastly more powerful and extensible way than regular expressions to extract information from HTML.

edited Jun 25, 2022 at 16:34

Stephen Ostermiller♦

25.8k18 gold badges96 silver badges117 bronze badges

answered Jul 30, 2011 at 12:55

senderle

152k36 gold badges218 silver badges244 bronze badges

8 Comments

voices Over a year ago

What's all the __init__ and self stuff?

vinyll Over a year ago

Beautiful is great is you need to parse href or src as asked in the initial question and should be the accepted answer, but beware it won't help to find URL in strings.

Al Sweigart Over a year ago

This doesn't answer the question, though. The question is about the format of URLs, not how to parse HTML.

senderle Over a year ago

@AlSweigart, I think it's reasonable to say that the body of the question asks about parsing HTML.

senderle Over a year ago

@AlSweigart, thanks for editing the title. I was thinking about this and realized that by my own logic, I should actually edit the title. Then I saw that you had done so already!

|

Collectives™ on Stack Overflow

Regex to extract URLs from href attribute in HTML with Python [duplicate]

2 Answers 2

12 Comments

Don't use a regex

Parse the HTML instead

8 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

12 Comments

Don't use a regex

Parse the HTML instead

8 Comments

Linked

Related