0

trying to scrape tables from open access academic articles, for some reasons I can't scrape the tables from this article, this is what I have done, but the resulting ResultSet "tables" is an empty list. thanks for any help.

from bs4 import BeautifulSoup
import requests

url_page = "http://www.sciencedirect.com/science/article/pii/S0378874116301696"

content = requests.get(url_page).content
soup = BeautifulSoup(content, "lxml")
tables = soup.find_all( "table" )
2
  • 2
    What is html? Commented Aug 26, 2017 at 13:37
  • sorry, I made an error while copying here the code , I now edited it as it should have been Commented Aug 26, 2017 at 13:42

1 Answer 1

2

There is no static <table> tag in the html of this page. It is a React based page, and tables are created dynamically with javascript.


Edit: Adding a script to fetch data

To scrape this page, I see two options:

  • As suggested by Håken Lid, you can use a headless browser simulator able to execute javascript like ghost.py, phantomjs, HtmlUnit, Selenium, etc
  • Or you can skim through the html/javascript source code, watch browser requests and find the data source.

I prefer the second one; this script prints the content of the page, including data in tables:

# Python 3
import requests, re, json

def discard_format(dico):
    if "_" in dico:
        return dico["_"]
    elif "$$" in dico:
        return dico["$$"]
    elif "$" in dico:
        return ""
    return dico

url_page = "http://www.sciencedirect.com/science/article/pii/S0378874116301696"
req = requests.get(url_page)
html = req.content.decode("utf-8")
token = re.search('"entitledToken":"(.*?)"', html).group(1)
url_data = "http://www.sciencedirect.com/sdfe/arp/pii/S0378874116301696/body?entitledToken=%s" % token
data = requests.get(url_data, cookies=req.cookies).content.decode("utf-8")
#print(data)
jsondata = json.loads(data, object_hook=discard_format)
print(jsondata)
Sign up to request clarification or add additional context in comments.

6 Comments

Yes that is correct but what is the answer? This is more of a comment than much else.
You would need to use something like selenium to scrape this site. It can't be done with plain http request, since the articles do not seem to be rendered server side.
Yes @cᴏʟᴅsᴘᴇᴇᴅ, I wanted to write a comment, but unfortunately I have not enough reputation points to do it..
@user3089520 I edited my answer to add a script scraping data without parsing html. Do you get all data you wanted or you really need to parse full verbose html?
your code works, but I am still trying to understand it. What is this line for? why is searching for "entitledToken"? token = re.search('"entitledToken":"(.*?)"', html).group(1)
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.