1

I am trying to use some Python web crawler to download about 3000 PDFs from a website. However, the URLs of those PDFs are generated by JavaScript function. So, I am wondering if there is any tutorial on how to achieve this?

For example, the URL linked to Alberto European Hairspray (Aerosol) - All Variants will be generated after clicking onclick="javascript:__doPostBack('ctl00$placeBody$gridView$gridView','DocumentCenter.aspx?did={0}$0&#39. So the question is how to let the web crawler to get the computed URL.

function __doPostBack(eventTarget, eventArgument) {
    if (!theForm.onsubmit || (theForm.onsubmit() != false)) {
        theForm.__EVENTTARGET.value = eventTarget;
        theForm.__EVENTARGUMENT.value = eventArgument;
        theForm.submit();
    }
}
<tbody>
    <tr>
        <td>
            <input type="image" src="App_Graphics/PDFDocument.gif" alt="MSDS" onclick="javascript:__doPostBack(&#39;ctl00$placeBody$gridView$gridView&#39;,&#39;DocumentCenter.aspx?did={0}$0&#39;);return false;" />
        </td>
        <td><a href="javascript:__doPostBack(&#39;ctl00$placeBody$gridView$gridView&#39;,&#39;MSDSDetail.aspx?did={0}$0&#39;)">Alberto European Hairspray (Aerosol) - All Variants</a>
        </td>
        <td>Unilever PLC</td>
        <td>8131-01</td>
    </tr>
    <tr class="row-alternate">
        <td>
            <input type="image" src="App_Graphics/PDFDocument.gif" alt="MSDS" onclick="javascript:__doPostBack(&#39;ctl00$placeBody$gridView$gridView&#39;,&#39;DocumentCenter.aspx?did={0}$1&#39;);return false;" />
        </td>
        <td><a href="javascript:__doPostBack(&#39;ctl00$placeBody$gridView$gridView&#39;,&#39;MSDSDetail.aspx?did={0}$1&#39;)">Alberto European Mousse (Aerosol) - All Variants</a>
        </td>
        <td>Unilever PLC</td>
        <td>8132-01</td>
    </tr>
</tbody>

2 Answers 2

1

You can't. Use a JavaScript interpreter (SpiderMonkey, for example) to execute the code and then go ahead with HTML parsing. Using Qt's WebKit is a good approach also, but probably slower.

Sign up to request clarification or add additional context in comments.

1 Comment

@tao.hong You're welcome. I had the same problem and I know it's somewhat disappointing :P
1

Another option is that you might use Selenium to execute js and get computed urls.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.