I am doing a project on web crawling for which I need to find all links within a given web page. Till now I was using urljoin in urllib.parse. But now I found that some links are not properly joined using the urljoin function.
For e.g. the <a> tag might be something like <a href="a.xml?value=basketball">A</a>. The complete address however might be http://www.example.org/main/test/a.xml?value=basketball, but the urljoin function will give wrong results ( something like http://www.example.com/a.xml?value=basketball).
Code which I am using:
parentUrl = urlQueue.get()
html = get_page_source(parentUrl)
bSoup = BeautifulSoup(html, 'html.parser')
aTags = bSoup.find_all('a', href=True)
for aTag in aTags:
childUrl = aTag.get('href')
# just to check if the url is complete or not(for .com only)
if '.com' not in childUrl:
# this urljoin is giving invalid resultsas mentioned above
childUrl = urljoin(parentUrl, childUrl)
Is there any way through which I can correctly join two URLs, including these cases ?