How to use Python (preferably pandas) to scrape data from Javascript table?

Question

I am using pandas to grab some ice hockey stats from a web page as shown below:

import pandas as pd

url_goal = 'http://www.quanthockey.com/nhl/records/nhl-players-all-time-goals-per-game-leaders.html'
df_goal = pd.read_html(url_goal, index_col=0, header=0)[0]

This works great, but the problem is that switching to the second page of the stats table on the homepage, does not change the url, so I cannot use the same approach to grab more than the top 50 players. There is a javascript address to the table that does change as the page number switches. I read a little about selenium and beautifulsoup, but I don't have these installed so I would prefer to do it without them is possible. So my question is two-fold:

Is there any way to grab data from the different pages in this javascript table using only pandas and standard Python/SciPy libraries (Anaconda to be exact)?
If not, how would you go about getting this data into a pandas data frame with the help of selenium or your package of choice?

Ryan · Accepted Answer · 2014-11-25 20:58:33Z

3

Hint: Open the network analyzer in your browser and watch what happens when you navigate to different pages; you'll notice a GET request to a page like

http://www.quanthockey.com/scripts/AjaxPaginate.php?cat=Records&pos=Players&SS=&af=0&nat=alltime&st=reg&sort=goals-per-game&page=3&league=NHL&lang=en&rnd=451318572

Notice the page part of the query string.

You can just iterate through the range of numbers corresponding to how many pages there are, changing the query string page parameter, increasing it by one each time (for example)

edited Nov 25, 2014 at 20:58

answered Nov 25, 2014 at 20:53

Ryan

3,7294 gold badges27 silver badges37 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

joelostblom Over a year ago

This works great thank you! Very useful tip overall, I wasn't aware of the network analyzer. Out of curiosity, do you know the purpose of that last random string of numbers? I am not including it and it is working just fine.

Ryan Over a year ago

Yes the network analyzer is quite useful - most of the time it can help in coming up with a strategy. Not sure what the rnd parameter is; presumably it serves some purpose or else it wouldn't be there - maybe some kind of internal record keeping.

Adi Over a year ago

Hi @Ryan , I have a link - quanthockey.com/khl/seasons/2020-21-khl-players-stats.html , I'm able to convert to csv using pandas but I have to change the code for every year like for 2020-21 , 2019-2020 and so on. Is there any way where I can get all the available data without every time changing the year in url

Collectives™ on Stack Overflow

How to use Python (preferably pandas) to scrape data from Javascript table?

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related