3

I'm newbie in parsing tables and regular expressions, can you help to parse this in python:

<table callspacing="0" cellpadding="0">
    <tbody><tr>
    <td>1text&nbsp;2text</td>
    <td>3text&nbsp;</td>
    </tr>
    <tr>
    <td>4text&nbsp;5text</td>
    <td>6text&nbsp;</td>
    </tr>
</tbody></table>

I need the "3text" and "6text"

1

6 Answers 6

3

You can use CSS selector select() and select_one() to get "3text" and "6text" like below:

import requests
from bs4 import BeautifulSoup
html_doc='''
<table callspacing="0" cellpadding="0">
    <tbody><tr>
    <td>1text&nbsp;2text</td>
    <td>3text&nbsp;</td>
    </tr>
    <tr>
    <td>4text&nbsp;5text</td>
    <td>6text&nbsp;</td>
    </tr>
</tbody></table>
'''

soup = BeautifulSoup(html_doc, 'lxml')
soup1 = soup.select('tr')

for i in soup1:
    print(i.select_one('td:nth-child(2)').text)

You can also use find_all method:

trs = soup.find('table').find_all('tr')

for i in trs:
    tds = i.find_all('td')
    print(tds[1].text)

Result:

3text 
6text 
Sign up to request clarification or add additional context in comments.

Comments

3

best way is to use beautifulsoup

from bs4 import BeautifulSoup

html_doc='''
<table callspacing="0" cellpadding="0">
    <tbody><tr>
    <td>1text&nbsp;2text</td>
    <td>3text&nbsp;</td>
    </tr>
    <tr>
    <td>4text&nbsp;5text</td>
    <td>6text&nbsp;</td>
    </tr>
</tbody></table>
'''


soup = BeautifulSoup(html_doc, "html.parser")

# finds all tr tags
for i in soup.find_all("tr"):
    # finds all td tags in tr tags
    for k in i.find_all("td"):
        # prints all td tags with a text format
        print(k.text)

in this case it prints

1text 2text
3text 
4text 5text
6text 

but you can grab the texts you want with indexing. In this case you could just go with

# finds all tr tags
for i in soup.find_all("tr"):
    # finds all td tags in tr tags
    print(i.find_all("td")[1].text)

Comments

2

using pandas

In [8]: import pandas as pd

In [9]: df =  pd.read_html(html_table)[0]

In [10]: df[1]
Out[10]:
0    3text
1    6text
Name: 1, dtype: object

Comments

1

you could use pythons html.parser: https://docs.python.org/3/library/html.parser.html

the custom parser class tracking a bit the state of the current parsing. since you want the second cell of each row, when starting a row, each row resets the cell counter (index). each cell increments the counter.

from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    def __init__(self):
        super().__init__()
        self.in_cell = False
        self.cell_index = -1

    def handle_starttag(self, tag, attrs):
        if tag == 'tr':
            self.cell_index = -1
        if tag == 'td':
            self.in_cell = True
            self.cell_index += 1
        # print("Encountered a start tag:", tag)

    def handle_endtag(self, tag):
        if tag == 'td':
            self.in_cell = False
        # print("Encountered an end tag :", tag)

    def handle_data(self, data):
        if self.in_cell and self.cell_index == 1:
            print(data.strip())

parser = MyHTMLParser()
parser.feed('''<table callspacing="0" cellpadding="0">
    <tbody><tr>
    <td>1text&nbsp;2text</td>
    <td>3text&nbsp;</td>
    </tr>
    <tr>
    <td>4text&nbsp;5text</td>
    <td>6text&nbsp;</td>
    </tr>
</tbody></table>''')

outputs:

> python -u "html_parser_test.py"
3text
6text

Comments

1

Since your question has the beautifulsoup tag attached I am going to assume that you are happy using this module to tackle the problem you are having. My solution also makes use of the builtin unicodedata module to parse any escaped characters present within the HTML (e.g. &nbsp;).

To parse the table so that you have access to the second field from each row within the table (as per your question), please see the below code/comments.

from bs4 import BeautifulSoup
import unicodedata

table = '''<table callspacing="0" cellpadding="0">
    <tbody><tr>
    <td>1text&nbsp;2text</td>
    <td>3text&nbsp;</td>
    </tr>
    <tr>
    <td>4text&nbsp;5text</td>
    <td>6text&nbsp;</td>
    </tr>
</tbody></table>'''

soup = BeautifulSoup(table, 'html.parser') # Parse HTML table 
tableData = soup.find_all('td') # Get list of all <td> tags from table
# Store normalized content (basically parse unicode characters, affecting spaces in this case) from every 2nd <td> tag from table to list 
output = [ unicodedata.normalize('NFKC', d.text) for i, d in enumerate(tableData) if i % 2 != 0 ]

Comments

1

Try this:

from bs4 import BeautifulSoup

html="""
<table callspacing="0" cellpadding="0">
    <tbody><tr>
    <td>1text&nbsp;2text</td>
    <td>3text&nbsp;</td>
    </tr>
    <tr>
    <td>4text&nbsp;5text</td>
    <td>6text&nbsp;</td>
    </tr>
</tbody></table>"""

soup = BeautifulSoup(html, 'html.parser')

for tr_soup in soup.find_all('tr'):
    td_soup = tr_soup.find_all('td')
    print(td_soup[1].text.strip())

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.