How do you extract a url from a string using python?

Question

For example:

string = "This is a link http://www.google.com"

How could I extract 'http://www.google.com' ?

(Each link will be of the same format i.e 'http://')

You might check out this answer: stackoverflow.com/questions/499345/… — rjz
– rjz, Commented Mar 18, 2012 at 17:42
If this is for a raw text file (as expressed in your question), you might check this answer: stackoverflow.com/questions/839994/extracting-a-url-in-python — Alexandre Dulaunoy
– Alexandre Dulaunoy, Commented Mar 18, 2012 at 17:45
Possible duplicate of What is the best regular expression to check if a string is a valid URL? — Yash Kumar Verma
– Yash Kumar Verma, Commented Sep 28, 2017 at 9:49

Abhijit · Accepted Answer · 2012-03-18 17:48:48Z

53

There may be few ways to do this but the cleanest would be to use regex

>>> myString = "This is a link http://www.google.com"
>>> print re.search("(?P<url>https?://[^\s]+)", myString).group("url")
http://www.google.com

If there can be multiple links you can use something similar to below

>>> myString = "These are the links http://www.google.com  and http://stackoverflow.com/questions/839994/extracting-a-url-in-python"
>>> print re.findall(r'(https?://[^\s]+)', myString)
['http://www.google.com', 'http://stackoverflow.com/questions/839994/extracting-a-url-in-python']
>>>

answered Mar 18, 2012 at 17:48

Abhijit

64k20 gold badges143 silver badges209 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

tripleee Over a year ago

This is too crude for many real-world scenarios. It fails entirely for ftp:// URLs and mailto: URLs etc, and will naïvely grab the tail part from <a href="http://google.com/">Click here</a> (i.e. up through "click").

teewuane Over a year ago

@tripleee The question isn't about parsing HTML, but finding a URL in a string of text that will always be http format. So this works really well for that. But yes, pretty important for people to know what you're saying if they're here for parsing HTML or similar.

Paolo Rovelli Over a year ago

Only, take in mind that the above regex will match also invalid URLs. For example: myString = "This is not a link http://not-a-valid-url"

Jonathan Leffler · Accepted Answer · 2017-07-06 00:21:15Z

34

There is another way how to extract URLs from text easily. You can use urlextract to do it for you, just install it via pip:

pip install urlextract

and then you can use it like this:

from urlextract import URLExtract

extractor = URLExtract()
urls = extractor.find_urls("Let's have URL stackoverflow.com as an example.")
print(urls) # prints: ['stackoverflow.com']

You can find more info on my github page: https://github.com/lipoja/URLExtract

NOTE: It downloads a list of TLDs from iana.org to keep you up to date. But if the program does not have internet access then it's not for you.

edited Jul 6, 2017 at 0:21

Jonathan Leffler

759k145 gold badges961 silver badges1.3k bronze badges

answered Feb 15, 2017 at 16:40

user7580408

3 Comments

Henrik Over a year ago

Works like a charm, and doesn't clutter the rest of my script.

autonopy Over a year ago

Unfortunately, this fails whenever there is text (i.e., not space) attached to the beginning or end of the url. e.g. ok/https://www.duckduckgo.com won't catch the url in it.

Tom Over a year ago

This is generally a great tool. However, it doesn't properly address text adjacent to a url, such as a line break ('\n') immediately following the url. It appends that to the identified url.

Paolo Rovelli · Accepted Answer · 2023-06-07 11:32:33Z

31

In order to find a web URL in a generic string, you can use a regular expression (regex). A relatively simple one like the following should fit your use case.

    import re

    string = "This is a link http://www.google.com"
    #string = "This is also a URL https://www.host.domain.com:80/path/page.php?query=value&a2=v2#foo but this is not anymore"

    regex = r'('
    # Scheme (HTTP, HTTPS, FTP and SFTP):
    regex += r'(?:(https?|s?ftp):\/\/)?'
    # www:
    regex += r'(?:www\.)?'
    regex += r'('
    # Host and domain (including ccSLD):
    regex += r'(?:(?:[A-Z0-9][A-Z0-9-]{0,61}[A-Z0-9]\.)+)'
    # TLD:
    regex += r'([A-Z]{2,6})'
    # IP Address:
    regex += r'|(?:\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})'
    regex += r')'
    # Port:
    regex += r'(?::(\d{1,5}))?'
    # Query path:
    regex += r'(?:(\/\S+)*)'
    regex += r')'
    
    find_urls_in_string = re.compile(regex, re.IGNORECASE)
    url = find_urls_in_string.search(string)
    if url is not None and url.group(0) is not None:
        print("URL parts: " + str(url.groups()))  # OUTPUT: ('http://www.google.com', 'http', 'google.com', 'com', None, None)
        print("URL" + url.group(0).strip())       # OUTPUT: http://www.google.com

NOTE: If you are looking for more URLs in a single string, you can still use the same regex, just use findall() instead of search().

That said, please, take in mind that the above regex is neither complete nor precise. It may match some invalid URIs or not match some valid ones (e.g., mailto:[email protected])!

You could make the regex more precise, for example, by ensuring that the TLD is a valid one (see the entire list of valid TLDs here: https://data.iana.org/TLD/tlds-alpha-by-domain.txt):

    # TLD:
    regex += r'(com|net|org|eu|...)'

EDITED:

The most accurate approach to find a web URL in a generic string is probably to simply split the string and validate each sub-string using validators or a similar library.

import validators

string = "This is a link http://www.google.com"
#string = "This is also a URL https://www.host.domain.com:80/path/page.php?query=value&a2=v2#foo but this is not anymore"

for substring in string.split(" "):
    if validators.url(substring):
        print("URL: " + substring)
    if validators.ip_address.ipv4(substring) or validators.ip_address.ipv6(substring):
        print("IP Address: " + substring)
    if validators.email(substring):
        print("Email Address: " + substring)

edited Jun 7, 2023 at 11:32

answered Aug 11, 2015 at 21:16

Paolo Rovelli

9,7553 gold badges61 silver badges45 bronze badges

3 Comments

luckydonald Over a year ago

So, the regex end up being

((?:(https?|s?ftp):\/\/)?(?:www\.)?((?:(?:[A-Z0-9][A-Z0-9-]{0,61}[A-Z0-9]\.)+)([A-Z]{2,6})|(?:\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}))(?::(\d{1,5}))?(?:(\/\S+)*))

. Also note the TLD list right now also includes fun endings like XN--VERMGENSBERATUNG-PWB, being 24 characters long, which will not be catched by this.

Mr_and_Mrs_D Over a year ago

Would be better to add (?i) to the pattern - more portable. Also, bear in mind this will match 23.084.828.566 which is not a valid IP address but is a valid float in some locales.

Jorge Orpinel Pérez Over a year ago

There's some sort of length limit to this regex e.g: docs.google.com/spreadsheets/d/10FmR8upvxZcZE1q9n1o40z16mygUJklkXQ7lwGS4nlI just matches docs.google.com/spreadsheets/d/10FmR8upvxZcZE1q9n.

Artem Bernatskyi · Accepted Answer · 2018-05-28 14:13:36Z

This extracts all urls with parameters, somehow all above examples haven't worked for me

import re

data = 'https://net2333.us3.list-some.com/subscribe/confirm?u=f3cca8a1ffdee924a6a413ae9&id=6c03fa85f8&e=6bbacccc5b'

WEB_URL_REGEX = r"""(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b/?(?!@)))"""
re.findall(WEB_URL_REGEX, text)

Comsavvy · Accepted Answer · 2021-05-29 10:45:22Z

You can extract any URL from a string using the following patterns,

1.

>>> import re
>>> string = "This is a link http://www.google.com"
>>> pattern = r'[(http://)|\w]*?[\w]*\.[-/\w]*\.\w*[(/{1})]?[#-\./\w]*[(/{1,})]?'
>>> re.search(pattern, string)
http://www.google.com

>>> TWEET = ('New Pybites article: Module of the Week - Requests-cache '
         'for Repeated API Calls - http://pybit.es/requests-cache.html '
         '#python #APIs')
>>> re.search(pattern, TWEET)
http://pybit.es/requests-cache.html

>>> tweet = ('Pybites My Reading List | 12 Rules for Life - #books '
             'that expand the mind! '
             'http://pbreadinglist.herokuapp.com/books/'
             'TvEqDAAAQBAJ#.XVOriU5z2tA.twitter'
             ' #psychology #philosophy')
>>> re.findall(pattern, TWEET)
['http://pbreadinglist.herokuapp.com/books/TvEqDAAAQBAJ#.XVOriU5z2tA.twitter']

to take the above pattern to the next level, we can also detect hashtags including URL the following ways

2.

>>> pattern = r'[(http://)|\w]*?[\w]*\.[-/\w]*\.\w*[(/{1})]?[#-\./\w]*[(/{1,})]?|#[.\w]*'
>>> re.findall(pattern, tweet)
['#books', http://pbreadinglist.herokuapp.com/books/TvEqDAAAQBAJ#.XVOriU5z2tA.twitter', '#psychology', '#philosophy']

The above example for taking URL and hashtags can be shortened to

>>> pattern = r'((?:#|http)\S+)'
>>> re.findall(pattern, tweet)
['#books', http://pbreadinglist.herokuapp.com/books/TvEqDAAAQBAJ#.XVOriU5z2tA.twitter', '#psychology', '#philosophy']

The pattern below can matches two alphanumeric separated by "." as URL

>>> pattern = pattern =  r'(?:http://)?\w+\.\S*[^.\s]'

>>> tweet = ('PyBites My Reading List | 12 Rules for Life - #books '
             'that expand the mind! '
             'www.google.com/telephone/wire....  '
             'http://pbreadinglist.herokuapp.com/books/'
             'TvEqDAAAQBAJ#.XVOriU5z2tA.twitter '
             "http://-www.pip.org "
             "google.com "
             "twitter.com "
             "facebook.com"
             ' #psychology #philosophy')
>>> re.findall(pattern, tweet)
['www.google.com/telephone/wire', 'http://pbreadinglist.herokuapp.com/books/TvEqDAAAQBAJ#.XVOriU5z2tA.twitter', 'www.pip.org', 'google.com', 'twitter.com', 'facebook.com']

You can try any complicated URL with the number 1 & 2 pattern. To learn more about re module in python, do check this out REGEXES IN PYTHON by Real Python.

Cheers!

Caumons · Accepted Answer · 2022-02-01 20:55:15Z

4

I've used a slight variation from @Abhijit's accepted answer.

This one uses \S instead of [^\s], which is equivalent but more concise. It also doesn't use a named group, because there is just one and we can ommit the name for simplicity reasons:

import re

my_string = "This is my tweet check it out http://example.com/blah"
print(re.search(r'(https?://\S+)', my_string).group())

Of course, if there are multiple links to extract, just use .findall():

print(re.findall(r'(https?://\S+)', my_string))

answered Feb 1, 2022 at 20:55

Caumons

9,64514 gold badges71 silver badges85 bronze badges

Collectives™ on Stack Overflow

How do you extract a url from a string using python?

6 Answers 6

3 Comments

3 Comments

3 Comments

Comments

Cheers!

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

3 Comments

3 Comments

3 Comments

Comments

Cheers!

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related