4

I'm trying to scrape a table from a dynamic page. After the following code (requires selenium), I manage to get the contents of the <table> elements.

I'd like to convert this table into a csv and I have tried 2 things, but both fail:

  • pandas.read_html returns an error saying I don't have html5lib installed, but I do and in fact I can import it without problems.
  • soup.find_all('tr') returns an error 'NoneType' object is not callable after I run soup = BeautifulSoup(tablehtml)

Here is my code:

import time
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.common.keys import Keys
import pandas as pd

main_url = "http://data.stats.gov.cn/english/easyquery.htm?cn=E0101"
driver = webdriver.Firefox()
driver.get(main_url)
time.sleep(7)
driver.find_element_by_partial_link_text("Industry").click()
time.sleep(7)
driver.find_element_by_partial_link_text("Main Economic Indicat").click()
time.sleep(6)
driver.find_element_by_id("mySelect_sj").click()
time.sleep(2)
driver.find_element_by_class_name("dtText").send_keys("last72")
time.sleep(3)
driver.find_element_by_class_name("dtTextBtn").click()
time.sleep(2)
table=driver.find_element_by_id("table_main")
tablehtml= table.get_attribute('innerHTML')
5
  • what is the ouput of tablehtml ? Commented Nov 10, 2015 at 15:29
  • it's too long to paste. It starts like this: <thead><tr class="tr-title"><th style="text-align:center;"><strong>Indicators</strong><span class="" code="zb"></span></th><th class=""><strong>Oct 2015</strong><span class="" code="201510"></span></th><th class=""><strong>Sep 2015</strong><span class="" code="201509"></span></th><th class=""><strong>Aug 2015</strong><span class="" code="201508"></span></th><th class=""><strong>Jul 2011... You can get the full version by running the code (you may need to pip install selenium) Commented Nov 10, 2015 at 15:31
  • I tested your code with selenium and bs4 and have got no issues. Somehow your soup object is returned as None. Commented Nov 10, 2015 at 16:03
  • You mean that you are able to use find_all() on my soup object? Commented Nov 10, 2015 at 16:08
  • yes. All I can recommend is check 'tablehtml' as well as 'soup' before you call anything on it. Commented Nov 10, 2015 at 16:09

2 Answers 2

13

Using the csv module and selenium selectors would probably be more convenient here:

import csv
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("http://example.com/")
table = driver.find_element_by_css_selector("#tableid")
with open('eggs.csv', 'w', newline='') as csvfile:
    wr = csv.writer(csvfile)
    for row in table.find_elements_by_css_selector('tr'):
        wr.writerow([d.text for d in row.find_elements_by_css_selector('td')])
Sign up to request clarification or add additional context in comments.

Comments

7

Without access to the table you're actually trying to scrape, I used this example:

<table>
<thead>
<tr>
    <td>Header1</td>
    <td>Header2</td>
    <td>Header3</td>
</tr>
</thead>  
<tr>
    <td>Row 11</td>
    <td>Row 12</td>
    <td>Row 13</td>
</tr>
<tr>
    <td>Row 21</td>
    <td>Row 22</td>
    <td>Row 23</td>
</tr>
<tr>
    <td>Row 31</td>
    <td>Row 32</td>
    <td>Row 33</td>
</tr>
</table>

and scraped it using:

from bs4 import BEautifulSoup as BS
content = #contents of that table
soup = BS(content, 'html5lib')
rows = [tr.findAll('td') for tr in soup.findAll('tr')]

This rows object is a list of lists:

[
    [<td>Header1</td>, <td>Header2</td>, <td>Header3</td>],
    [<td>Row 11</td>, <td>Row 12</td>, <td>Row 13</td>],
    [<td>Row 21</td>, <td>Row 22</td>, <td>Row 23</td>],
    [<td>Row 31</td>, <td>Row 32</td>, <td>Row 33</td>]
]

...and you can write it to a file:

for it in rows:
with open('result.csv', 'a') as f:
    f.write(", ".join(str(e).replace('<td>','').replace('</td>','') for e in it) + '\n')

which looks like this:

Header1, Header2, Header3
Row 11, Row 12, Row 13
Row 21, Row 22, Row 23
Row 31, Row 32, Row 33

3 Comments

Thanks but like I mention, beautifusoupl returns a nonetype in my case and find_all does not work. Also the table can be accessed by simply running the code.
I was giving you a working example; BeautifulSoup does work, all you have to do is pay attention to its methods. Did you even try to run the example code I provided? You called "find_all" and it errored because the method I used is "findAll". Please don't ask people to install selenium and run your script to get the table you need help with. Cheers
I had read that findAll and find_all are exacly the same (stackoverflow.com/questions/12339323/…) but apparently, that is the case for "from bs4 import BeautifulSoup", not for "from BeautifulSoup import BeautifulSoup", which is what I used. So you were right, sorry.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.