web scraping a table with python

Question

trying to scrape tables from open access academic articles, for some reasons I can't scrape the tables from this article, this is what I have done, but the resulting ResultSet "tables" is an empty list. thanks for any help.

from bs4 import BeautifulSoup
import requests

url_page = "http://www.sciencedirect.com/science/article/pii/S0378874116301696"

content = requests.get(url_page).content
soup = BeautifulSoup(content, "lxml")
tables = soup.find_all( "table" )

sorry, I made an error while copying here the code , I now edited it as it should have been — user3089520
– user3089520, Commented Aug 26, 2017 at 13:42

Marsu · Accepted Answer · 2017-08-31 00:29:47Z

2

There is no static <table> tag in the html of this page. It is a React based page, and tables are created dynamically with javascript.

Edit: Adding a script to fetch data

To scrape this page, I see two options:

As suggested by Håken Lid, you can use a headless browser simulator able to execute javascript like ghost.py, phantomjs, HtmlUnit, Selenium, etc
Or you can skim through the html/javascript source code, watch browser requests and find the data source.

I prefer the second one; this script prints the content of the page, including data in tables:

# Python 3
import requests, re, json

def discard_format(dico):
    if "_" in dico:
        return dico["_"]
    elif "$$" in dico:
        return dico["$$"]
    elif "$" in dico:
        return ""
    return dico

url_page = "http://www.sciencedirect.com/science/article/pii/S0378874116301696"
req = requests.get(url_page)
html = req.content.decode("utf-8")
token = re.search('"entitledToken":"(.*?)"', html).group(1)
url_data = "http://www.sciencedirect.com/sdfe/arp/pii/S0378874116301696/body?entitledToken=%s" % token
data = requests.get(url_data, cookies=req.cookies).content.decode("utf-8")
#print(data)
jsondata = json.loads(data, object_hook=discard_format)
print(jsondata)

edited Aug 31, 2017 at 0:29

answered Aug 26, 2017 at 13:50

Marsu

7966 silver badges10 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

cs95 Over a year ago

Yes that is correct but what is the answer? This is more of a comment than much else.

Håken Lid Over a year ago

You would need to use something like selenium to scrape this site. It can't be done with plain http request, since the articles do not seem to be rendered server side.

Marsu Over a year ago

Yes @cᴏʟᴅsᴘᴇᴇᴅ, I wanted to write a comment, but unfortunately I have not enough reputation points to do it..

Marsu Over a year ago

@user3089520 I edited my answer to add a script scraping data without parsing html. Do you get all data you wanted or you really need to parse full verbose html?

user3089520 Over a year ago

your code works, but I am still trying to understand it. What is this line for? why is searching for "entitledToken"? token = re.search('"entitledToken":"(.*?)"', html).group(1)

|

Collectives™ on Stack Overflow

web scraping a table with python

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related