3

I'm trying to scrape this website:

http://stats.uis.unesco.org/unesco/TableViewer/tableView.aspx?ReportId=210

using Python and Selenium (see code below). The content is dynamically generated, and apparently data which is not visible in the browser is not loaded. I have tried making the browser window larger, and scrolling to the bottom of the page. Enlarging the window gets me all the data I want in the horizontal direction, but there is still plenty of data to scrape in the vertical direction. The scrolling appears not to work at all.

Does anyone have any bright ideas about how to do this?

Thanks!

from selenium import webdriver
import time

url = "http://stats.uis.unesco.org/unesco/TableViewer/tableView.aspx?ReportId=210"
driver = webdriver.Firefox()
driver.get(url)
driver.set_window_position(0, 0)
driver.set_window_size(100000, 200000)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

time.sleep(5) # wait to load

soup = BeautifulSoup(driver.page_source)

table = soup.find("table", {"id":"DataTable"})

### get data
thead = table.find('tbody')
loopRows = thead.findAll('tr')
rows = []
for row in loopRows:
rows.append([val.text.encode('ascii', 'ignore') for val in  row.findAll(re.compile('td|th'))])
with open("body.csv", 'wb') as test_file:
  file_writer = csv.writer(test_file)
  for row in rows:
      file_writer.writerow(row)
4
  • What exact data are you trying to extract from this page? Commented May 23, 2013 at 10:03
  • Is downloading the report data via the actions menu not an option? Commented May 23, 2013 at 10:06
  • @Virendra Rajput: I want Adult (15+) literacy rate (%). Total for all years and countries Commented May 23, 2013 at 10:35
  • @scdove: If you can automate the download via the action menu then you will be my hero of the day, and I would forever be in your debt. :) Commented May 23, 2013 at 10:37

2 Answers 2

5

This will get you as far as autosaving the entire csv to disk, but I haven't found a robust way to determine when the download is complete:

import os
import contextlib
import selenium.webdriver as webdriver
import csv
import time

url = "http://stats.uis.unesco.org/unesco/TableViewer/tableView.aspx?ReportId=210"
download_dir = '/tmp'
fp = webdriver.FirefoxProfile()
fp.set_preference("browser.download.dir", download_dir)
# 2 means "use the last folder specified for a download"
fp.set_preference("browser.download.folderList", 2)
fp.set_preference("browser.download.manager.showWhenStarting", False)
fp.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/x-csv")

# driver = webdriver.Firefox(firefox_profile=fp)
with contextlib.closing(webdriver.Firefox(firefox_profile=fp)) as driver:
    driver.get(url)
    driver.execute_script("onDownload(2);")
    csvfile = os.path.join(download_dir, 'download.csv')

    # Wait for the download to complete
    time.sleep(10)
    with open(csvfile, 'rb') as f:
        for line in csv.reader(f, delimiter=','):
            print(line)

Explanation:

Point your browser to url. You'll see there is an Actions menu with an option to Download report data... and a suboption entitled "Comma-delimited ASCII format (*.csv)". If you inspect the HTML for these words you'll find

"Comma-delimited ASCII format (*.csv)","","javascript:onDownload(2);"

So it follows naturally that you might try getting webdriver to execute the JavaScript function call onDownload(2). We can do that with

driver.execute_script("onDownload(2);")

but normally another window will then pop up asking if you want save the file. To automate the saving-to-disk, I used the method described in this FAQ. The tricky part is finding the correct MIME type to specify on this line:

fp.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/x-csv")

The curl method described in the FAQ does not work here since we do not have a url for the csv file. However, this page describes another way to find the MIME type: Use a Firefox browser to open the save dialog. Check the checkbox saying "Do this automatically for files like this". Then inspect the last few lines of ~/.mozilla/firefox/*/mimeTypes.rdf for the most recently added description:

  <RDF:Description RDF:about="urn:mimetype:handler:application/x-csv"
                   NC:alwaysAsk="false"
                   NC:saveToDisk="true">
    <NC:externalApplication RDF:resource="urn:mimetype:externalApplication:application/x-csv"/>
  </RDF:Description>

This tells us the mime type is "application/x-csv". Bingo, we are in business.

Sign up to request clarification or add additional context in comments.

2 Comments

Nice, that works. Additionally, I've been having trouble with setting Firefox download preferences, which this also solves. Thanks!
Another way to find the MIME type in Chrome is to hit F12, then click on the network tab. Then download the file. Sometimes the server does not publish the correct type, as I just found out when trying to download an .xls file with MIME type 'application/octet-stream'
0

You can do the scrolling by

self.driver.find_element_by_css_selector("html body.TVTableBody table#pageTable tbody tr td#cell4 table#MainTable tbody tr td#vScrollTD img[onmousedown='imgClick(this.sbar.visible,this,event);']").click()

It seems like once you can scroll the scraping should be pretty standard unless I'm missing something

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.