Executing JavaScript in href of links with Python

Question

I am trying to download some PDF's automatically on a site (http://bibliotecadigitalhispanica.bne.es) using Python.

I've tried using the urllib/urllib2/mechanize modules (which I have been using for other sites: this includes the standard functions like urlopen, urlretrieve, etc.), but here, the links have JavaScript embedded in their href attributes that does some processing and opens up the PDF, which these modules don't seem to be able to handle, at least from what I have read here. For example, when I do the following:

request = mechanize.Request('the example url below')
response = mechanize.urlopen(request)

it just get back the containing HTML page - I just can't seem to extract the PDF (there are no links to it inside that page, either).

I know by looking through the headers in a real browser (using the LiveHTTPHeaders extension in Firefox) that a lot of HTTP requests are made and eventually the PDF is returned (and displayed in the browser). I would like to be able to intercept this and download it. Concretely, I get a series of 302 and 304 responses, eventually leading to the PDF.

Here is an example of a link attribute that I am crawling: href='javascript:open_window_delivery("http://bibliotecadigitalhispanica.bne.es:80/verylonglinktoaccess");'

It seems that if I execute this JavaScript embedded in the href attribute, I can eventually reach the PDF document itself. I've tried with selenium, but it is a tad confusing - I'm not quite sure how to use it upon reading its documentation. Can someone suggest a way (either through a module I haven't tried or through one that I have) that I can do this?

Thank you very much for any help with this.

P.S.: in case you would like to see what I am trying to replicate, I am trying to access the PDF links mentioned above on the following page (the ones with the PDF icons):): http://bibliotecadigitalhispanica.bne.es/R/9424CFL1MDQGLGBB98QSV1HFAD2APYDME4GQKCBSLXFX154L4G-01075?func=collections-result&collection_id=1356

I've tried doing this also, pulling out the URI from inside the JavaScript function call and then trying to access that with mechanize and urllib2, but no luck so far - it just gives me back the containing html page :-/ From viewing the headers, it does seem like a lot of requests are made with this URI, including some redirects. Is there a way to grab all these responses? Perhaps that might also solve the issue. Thank you for the response, by the way. — spanport
– spanport, Commented Mar 17, 2012 at 0:14
UPDATE: I ended up finding a way around it on this particular site by finding the structure of the URL's that were closest to the PDF files and then redirecting from those. Cheers! — spanport
– spanport, Commented Mar 17, 2012 at 13:41

j13r · Accepted Answer · 2012-03-18 21:56:59Z

1

javascript:open_window_delivery("http://bibliotecadigitalhispanica.bne.es:80/webclient/DeliveryManager?application=DIGITOOL-3&owner=resourcediscovery&custom_att_2=simple_viewer&forebear_coll=1333&user=GUEST&pds_handle=&pid=1673416&con_lng=SPA&rd_session=http://bibliotecadigitalhispanica.bne.es:80/R/7IUR42HNR5J19AY1Y3QJTL1P9M2AN81RCY4DRFE8JN5T22BI7I-03416");

That URL leads to a 302 page. If you follow it, you end up at a frame page, where the bottom frame is the content page.

http://bibliotecadigitalhispanica.bne.es///exlibris/dtl/d3_1/apache_media/L2V4bGlicmlzL2R0bC9kM18xL2FwYWNoZV9tZWRpYS8xNjczNDE2.pdf

(lib)curl can follow 302 pages.

Javascript isn't the problem so far. Then you are at single_viewer_toolbar2.jsp, where the function setLabelMetadataStream puts together the url for the pdf, before submitting that to its iframe "sendRequestIFrame".

I see 3 possibilities:

the javascript-execution approach: High complexity, need to program lots of code, probably brittle
Something based on a browser: Selenium is probably good. I know elinks2 has javascript support, and according to its wikipedia page it can be scripted in "Perl, Ruby, Lua and GNU Guile".
Ask the web administrator for help. You should do this anyways to understand their policy/attitude towards bots. Perhaps they can provide you (and others) with an interface/API.

I recommend learning more about Selenium, it seems the easiest.

answered Mar 18, 2012 at 21:56

j13r

2,7312 gold badges23 silver badges33 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Li-aung Yip Over a year ago

+1 for Selenium, which is probably the most sane (least work) solution. And another +1 for "ask the administrator".

Collectives™ on Stack Overflow

Executing JavaScript in href of links with Python

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related