Extract complete URL from href using Python

Question

I am doing a project on web crawling for which I need to find all links within a given web page. Till now I was using urljoin in urllib.parse. But now I found that some links are not properly joined using the urljoin function.

For e.g. the <a> tag might be something like <a href="a.xml?value=basketball">A</a>. The complete address however might be http://www.example.org/main/test/a.xml?value=basketball, but the urljoin function will give wrong results ( something like http://www.example.com/a.xml?value=basketball).

Code which I am using:

parentUrl = urlQueue.get()

html = get_page_source(parentUrl)

bSoup = BeautifulSoup(html, 'html.parser')
aTags = bSoup.find_all('a', href=True)

for aTag in aTags:
    childUrl = aTag.get('href')

    # just to check if the url is complete or not(for .com only)
    if '.com' not in childUrl:
        # this urljoin is giving invalid resultsas mentioned above
        childUrl = urljoin(parentUrl, childUrl)

Is there any way through which I can correctly join two URLs, including these cases ?

You are more likely to get help if you provide minimal working code to build on. — handle
– handle, Commented Aug 16, 2016 at 9:19
Tell me if you need something else... However my main concern is to make absolute link address using href attribute, which sometimes may not contain the complete path. — user4340135
– user4340135, Commented Aug 16, 2016 at 11:52

Sascha Gottfried · Accepted Answer · 2016-08-16 12:36:57Z

1

Just some tweaks to get this working. In your case pass base URI with trailing slash. Everything you will need to accomplish this is written in the docs of urlparse

>>> import urlparse
>>> urlparse.urljoin('http://www.example.org/main/test','a.xml?value=basketball')
'http://www.example.org/main/a.xml?value=basketball'
>>> urlparse.urljoin('http://www.example.org/main/test/','a.xml?value=basketball')
'http://www.example.org/main/test/a.xml?value=basketball'

BTW: this is a perfect use case to factor out the code for building URLs into a separate function. Then write some unit tests to verify its working as expected and even works with your edge cases. Afterwards use it in your web crawler code.

edited Aug 16, 2016 at 12:36

answered Aug 16, 2016 at 12:29

Sascha Gottfried

3,32923 silver badges30 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Sascha Gottfried Over a year ago

This code works for Python 2.7 but can be applied to Python 3.5 if you use urllib.parse as the OP mentioned.

user4340135 Over a year ago

Thank you sir. It seems to work on some cases I tried. Let me test it completely before accepting this answer.

Collectives™ on Stack Overflow

Extract complete URL from href using Python

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related