How to parse html table in python

Question

I'm newbie in parsing tables and regular expressions, can you help to parse this in python:

<table callspacing="0" cellpadding="0">
    <tbody><tr>
    <td>1text&nbsp;2text</td>
    <td>3text&nbsp;</td>
    </tr>
    <tr>
    <td>4text&nbsp;5text</td>
    <td>6text&nbsp;</td>
    </tr>
</tbody></table>

I need the "3text" and "6text"

There's a standalone html-table-parser-python3; it works on table 5 in Wikipedia Windturbines_in_Nederland, BeautifulSoup doesn't. — denis
– denis, Commented Sep 7, 2022 at 12:42

Humayun Ahmad Rajib · Accepted Answer · 2020-07-22 10:32:39Z

3

You can use CSS selector select() and select_one() to get "3text" and "6text" like below:

import requests
from bs4 import BeautifulSoup
html_doc='''
<table callspacing="0" cellpadding="0">
    <tbody><tr>
    <td>1text&nbsp;2text</td>
    <td>3text&nbsp;</td>
    </tr>
    <tr>
    <td>4text&nbsp;5text</td>
    <td>6text&nbsp;</td>
    </tr>
</tbody></table>
'''

soup = BeautifulSoup(html_doc, 'lxml')
soup1 = soup.select('tr')

for i in soup1:
    print(i.select_one('td:nth-child(2)').text)

You can also use find_all method:

trs = soup.find('table').find_all('tr')

for i in trs:
    tds = i.find_all('td')
    print(tds[1].text)

Result:

3text 
6text

edited Jul 22, 2020 at 10:32

answered Jul 22, 2020 at 9:01

Humayun Ahmad Rajib

1,5581 gold badge12 silver badges25 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Saba · Accepted Answer · 2020-07-22 09:10:16Z

best way is to use beautifulsoup

from bs4 import BeautifulSoup

html_doc='''
<table callspacing="0" cellpadding="0">
    <tbody><tr>
    <td>1text&nbsp;2text</td>
    <td>3text&nbsp;</td>
    </tr>
    <tr>
    <td>4text&nbsp;5text</td>
    <td>6text&nbsp;</td>
    </tr>
</tbody></table>
'''


soup = BeautifulSoup(html_doc, "html.parser")

# finds all tr tags
for i in soup.find_all("tr"):
    # finds all td tags in tr tags
    for k in i.find_all("td"):
        # prints all td tags with a text format
        print(k.text)

in this case it prints

1text 2text
3text 
4text 5text
6text

but you can grab the texts you want with indexing. In this case you could just go with

# finds all tr tags
for i in soup.find_all("tr"):
    # finds all td tags in tr tags
    print(i.find_all("td")[1].text)

Akhilesh_IN · Accepted Answer · 2020-07-22 09:02:51Z

2

using pandas

In [8]: import pandas as pd

In [9]: df =  pd.read_html(html_table)[0]

In [10]: df[1]
Out[10]:
0    3text
1    6text
Name: 1, dtype: object

answered Jul 22, 2020 at 9:02

Akhilesh_IN

1,3271 gold badge13 silver badges21 bronze badges

Comments

Soraphis · Accepted Answer · 2020-07-22 09:08:48Z

you could use pythons html.parser: https://docs.python.org/3/library/html.parser.html

the custom parser class tracking a bit the state of the current parsing. since you want the second cell of each row, when starting a row, each row resets the cell counter (index). each cell increments the counter.

from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    def __init__(self):
        super().__init__()
        self.in_cell = False
        self.cell_index = -1

    def handle_starttag(self, tag, attrs):
        if tag == 'tr':
            self.cell_index = -1
        if tag == 'td':
            self.in_cell = True
            self.cell_index += 1
        # print("Encountered a start tag:", tag)

    def handle_endtag(self, tag):
        if tag == 'td':
            self.in_cell = False
        # print("Encountered an end tag :", tag)

    def handle_data(self, data):
        if self.in_cell and self.cell_index == 1:
            print(data.strip())

parser = MyHTMLParser()
parser.feed('''<table callspacing="0" cellpadding="0">
    <tbody><tr>
    <td>1text&nbsp;2text</td>
    <td>3text&nbsp;</td>
    </tr>
    <tr>
    <td>4text&nbsp;5text</td>
    <td>6text&nbsp;</td>
    </tr>
</tbody></table>''')

outputs:

> python -u "html_parser_test.py"
3text
6text

JPI93 · Accepted Answer · 2020-07-22 09:17:26Z

Since your question has the beautifulsoup tag attached I am going to assume that you are happy using this module to tackle the problem you are having. My solution also makes use of the builtin unicodedata module to parse any escaped characters present within the HTML (e.g.  ).

To parse the table so that you have access to the second field from each row within the table (as per your question), please see the below code/comments.

from bs4 import BeautifulSoup
import unicodedata

table = '''<table callspacing="0" cellpadding="0">
    <tbody><tr>
    <td>1text&nbsp;2text</td>
    <td>3text&nbsp;</td>
    </tr>
    <tr>
    <td>4text&nbsp;5text</td>
    <td>6text&nbsp;</td>
    </tr>
</tbody></table>'''

soup = BeautifulSoup(table, 'html.parser') # Parse HTML table 
tableData = soup.find_all('td') # Get list of all <td> tags from table
# Store normalized content (basically parse unicode characters, affecting spaces in this case) from every 2nd <td> tag from table to list 
output = [ unicodedata.normalize('NFKC', d.text) for i, d in enumerate(tableData) if i % 2 != 0 ]

Saba · Accepted Answer · 2020-07-22 09:20:48Z

1

Try this:

from bs4 import BeautifulSoup

html="""
<table callspacing="0" cellpadding="0">
    <tbody><tr>
    <td>1text&nbsp;2text</td>
    <td>3text&nbsp;</td>
    </tr>
    <tr>
    <td>4text&nbsp;5text</td>
    <td>6text&nbsp;</td>
    </tr>
</tbody></table>"""

soup = BeautifulSoup(html, 'html.parser')

for tr_soup in soup.find_all('tr'):
    td_soup = tr_soup.find_all('td')
    print(td_soup[1].text.strip())

edited Jul 22, 2020 at 9:20

Saba

4845 silver badges15 bronze badges

answered Jul 22, 2020 at 9:07

Kaushal Kumar

1,3161 gold badge12 silver badges21 bronze badges

Collectives™ on Stack Overflow

How to parse html table in python

6 Answers 6

Comments

Comments

Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

Comments

Comments

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related