convert html table to csv in python

Question

I'm trying to scrape a table from a dynamic page. After the following code (requires selenium), I manage to get the contents of the <table> elements.

I'd like to convert this table into a csv and I have tried 2 things, but both fail:

pandas.read_html returns an error saying I don't have html5lib installed, but I do and in fact I can import it without problems.
soup.find_all('tr') returns an error 'NoneType' object is not callable after I run soup = BeautifulSoup(tablehtml)

Here is my code:

import time
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.common.keys import Keys
import pandas as pd

main_url = "http://data.stats.gov.cn/english/easyquery.htm?cn=E0101"
driver = webdriver.Firefox()
driver.get(main_url)
time.sleep(7)
driver.find_element_by_partial_link_text("Industry").click()
time.sleep(7)
driver.find_element_by_partial_link_text("Main Economic Indicat").click()
time.sleep(6)
driver.find_element_by_id("mySelect_sj").click()
time.sleep(2)
driver.find_element_by_class_name("dtText").send_keys("last72")
time.sleep(3)
driver.find_element_by_class_name("dtTextBtn").click()
time.sleep(2)
table=driver.find_element_by_id("table_main")
tablehtml= table.get_attribute('innerHTML')

it's too long to paste. It starts like this: <thead><tr class="tr-title"><th style="text-align:center;">Indicators</th><th class="">Oct 2015</th><th class="">Sep 2015</th><th class="">Aug 2015</th><th class="">Jul 2011... You can get the full version by running the code (you may need to pip install selenium) — Alexis Eggermont
– Alexis Eggermont, Commented Nov 10, 2015 at 15:31
I tested your code with selenium and bs4 and have got no issues. Somehow your soup object is returned as None. — sagarchalise
– sagarchalise, Commented Nov 10, 2015 at 16:03
You mean that you are able to use find_all() on my soup object? — Alexis Eggermont
– Alexis Eggermont, Commented Nov 10, 2015 at 16:08
yes. All I can recommend is check 'tablehtml' as well as 'soup' before you call anything on it. — sagarchalise
– sagarchalise, Commented Nov 10, 2015 at 16:09

AXO · Accepted Answer · 2017-02-03 08:29:58Z

13

Using the csv module and selenium selectors would probably be more convenient here:

import csv
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("http://example.com/")
table = driver.find_element_by_css_selector("#tableid")
with open('eggs.csv', 'w', newline='') as csvfile:
    wr = csv.writer(csvfile)
    for row in table.find_elements_by_css_selector('tr'):
        wr.writerow([d.text for d in row.find_elements_by_css_selector('td')])

answered Feb 3, 2017 at 8:29

AXO

9,2266 gold badges73 silver badges69 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

wpercy · Accepted Answer · 2015-12-21 15:02:38Z

7

Without access to the table you're actually trying to scrape, I used this example:

<table>
<thead>
<tr>
    <td>Header1</td>
    <td>Header2</td>
    <td>Header3</td>
</tr>
</thead>  
<tr>
    <td>Row 11</td>
    <td>Row 12</td>
    <td>Row 13</td>
</tr>
<tr>
    <td>Row 21</td>
    <td>Row 22</td>
    <td>Row 23</td>
</tr>
<tr>
    <td>Row 31</td>
    <td>Row 32</td>
    <td>Row 33</td>
</tr>
</table>

and scraped it using:

from bs4 import BEautifulSoup as BS
content = #contents of that table
soup = BS(content, 'html5lib')
rows = [tr.findAll('td') for tr in soup.findAll('tr')]

This rows object is a list of lists:

[
    [<td>Header1</td>, <td>Header2</td>, <td>Header3</td>],
    [<td>Row 11</td>, <td>Row 12</td>, <td>Row 13</td>],
    [<td>Row 21</td>, <td>Row 22</td>, <td>Row 23</td>],
    [<td>Row 31</td>, <td>Row 32</td>, <td>Row 33</td>]
]

...and you can write it to a file:

for it in rows:
with open('result.csv', 'a') as f:
    f.write(", ".join(str(e).replace('<td>','').replace('</td>','') for e in it) + '\n')

which looks like this:

Header1, Header2, Header3
Row 11, Row 12, Row 13
Row 21, Row 22, Row 23
Row 31, Row 32, Row 33

edited Dec 21, 2015 at 15:02

wpercy

10.2k4 gold badges35 silver badges50 bronze badges

answered Nov 10, 2015 at 16:37

Kruger

1771 silver badge4 bronze badges

3 Comments

Alexis Eggermont Over a year ago

Thanks but like I mention, beautifusoupl returns a nonetype in my case and find_all does not work. Also the table can be accessed by simply running the code.

Kruger Over a year ago

I was giving you a working example; BeautifulSoup does work, all you have to do is pay attention to its methods. Did you even try to run the example code I provided? You called "find_all" and it errored because the method I used is "findAll". Please don't ask people to install selenium and run your script to get the table you need help with. Cheers

Alexis Eggermont Over a year ago

I had read that findAll and find_all are exacly the same (stackoverflow.com/questions/12339323/…) but apparently, that is the case for "from bs4 import BeautifulSoup", not for "from BeautifulSoup import BeautifulSoup", which is what I used. So you were right, sorry.

Collectives™ on Stack Overflow

convert html table to csv in python

2 Answers 2

Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related