0

I am doing a project on web crawling for which I need to find all links within a given web page. Till now I was using urljoin in urllib.parse. But now I found that some links are not properly joined using the urljoin function.

For e.g. the <a> tag might be something like <a href="a.xml?value=basketball">A</a>. The complete address however might be http://www.example.org/main/test/a.xml?value=basketball, but the urljoin function will give wrong results ( something like http://www.example.com/a.xml?value=basketball).

Code which I am using:

parentUrl = urlQueue.get()

html = get_page_source(parentUrl)

bSoup = BeautifulSoup(html, 'html.parser')
aTags = bSoup.find_all('a', href=True)

for aTag in aTags:
    childUrl = aTag.get('href')

    # just to check if the url is complete or not(for .com only)
    if '.com' not in childUrl:
        # this urljoin is giving invalid resultsas mentioned above
        childUrl = urljoin(parentUrl, childUrl)

Is there any way through which I can correctly join two URLs, including these cases ?

3
  • You are more likely to get help if you provide minimal working code to build on. Commented Aug 16, 2016 at 9:19
  • Tell me if you need something else... However my main concern is to make absolute link address using href attribute, which sometimes may not contain the complete path. Commented Aug 16, 2016 at 11:52
  • Delete the NOTE. It is a browser feature. Commented Aug 16, 2016 at 12:31

1 Answer 1

1

Just some tweaks to get this working. In your case pass base URI with trailing slash. Everything you will need to accomplish this is written in the docs of urlparse

>>> import urlparse
>>> urlparse.urljoin('http://www.example.org/main/test','a.xml?value=basketball')
'http://www.example.org/main/a.xml?value=basketball'
>>> urlparse.urljoin('http://www.example.org/main/test/','a.xml?value=basketball')
'http://www.example.org/main/test/a.xml?value=basketball'

BTW: this is a perfect use case to factor out the code for building URLs into a separate function. Then write some unit tests to verify its working as expected and even works with your edge cases. Afterwards use it in your web crawler code.

Sign up to request clarification or add additional context in comments.

2 Comments

This code works for Python 2.7 but can be applied to Python 3.5 if you use urllib.parse as the OP mentioned.
Thank you sir. It seems to work on some cases I tried. Let me test it completely before accepting this answer.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.