Scraping hidden content from a javascript webpage with python

Question

I'm trying to scrape the content from the following website:

https://mobile.admiral.at/en/event/event/all#/event/15a822ab-84a1-e511-90a2-000c297013a7

I have previously scraped the content successfully using dryscrape and the following code:

import dryscrape
import webkit_server
from lxml import html

session = dryscrape.Session()
session.set_timeout(20)
session.set_attribute('auto_load_images', False)
session.visit('https://mobile.admiral.at/en/event/event/all#/event/15a822ab-84a1-e511-90a2-000c297013a7')
response = session.body()
tree = html.fromstring(response)

print(tree.xpath('(//td[@class="team-name"]/text())[1]'))

The above example would print the home team (which in this case would be 'France')

It seems that the structure of the source has been changed, so I'm unable to scrape the contents properly.

What confuses me is that I'm able to see the tags using the Firefox Inspector tool, however it's not visible in the response when I pull the source.

I assume they must have hidden the content somehow to make it impossible (?) to scrape the data.

Could someone please point me in the right direction how to scrape the content properly.

Curro · Accepted Answer · 2016-06-09 16:01:24Z

1

The content that you need is loaded using jQuery (Ajax). I don't know if dryscrape has been updated lately, but the last time I used it didn't support ajax content loaded from jQuery...

Anyway.. just taking a look to the network inspector of chrome you will realize that the main content is loaded using an API. You can call to that API directly and you will get an awesome JSON with all the data of the page:

import requests
data = requests.get('https://mobile.admiral.at/;apiVer=json;api=main;jsonType=object;apiRw=1/en/api/event/get-event?id=15a822ab-84a1-e511-90a2-000c297013a7').json()

answered Jun 9, 2016 at 16:01

Curro

1,4211 gold badge14 silver badges24 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Trect Over a year ago

I have exactly same problem with [this Website][1] . I can see the entire text via 'inspect element'. But cannot use selenium (python) to extract the text. Any idea how to overcome? Thanks in advance [1]: pib.nic.in/PressReleseDetail.aspx?PRID=1573651

Curro Over a year ago

Exactly like the prevous post. just taking a look to the network inspector you will see that the url which load the content is this one: pib.gov.in/PressReleasePage.aspx?PRID=1573651 which is launched via Ajax

Collectives™ on Stack Overflow

Scraping hidden content from a javascript webpage with python

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related