Read a list of Nested HTML Tables in Python

Question

I have a huge list of HTML tables without any id or attributes that I need to export into a friendly format (CSVs/Excel).

The HTML format is as follows:

+---------+--------+------+------+
|  Col1   |  Col2  |      |      |
+---------+--------+------+------+
| Header1 | Value1 |      |      |
| Header2 | Value2 |      |      |
| Header3 |        |      |      |
|+---------+--------+------+------+
|| ColA   | ColB   |ColC  | ColD |
|+---------+--------+------+------+
|| ValueA | ValueB |ValueC|ValueD|
|| ValueA | ValueB |ValueC|ValueD|
|+---------+--------+------+------+
+---------+--------+------+------+

+---------+--------+------+------+
|  Col1   |  Col2  |      |      |
+---------+--------+------+------+
| Header1 | Value1 |      |      |
| Header2 | Value2 |      |      |
| Header3 | Value3 |      |      |
| Header4 |        |      |      |
|+---------+--------+------+------+
|| ColA   | ColB   |ColC  | ColD |
|+---------+--------+------+------+
|| ValueA | ValueB |ValueC|ValueD|
|| ValueA | ValueB |ValueC|ValueD|
|+---------+--------+------+------+
+---------+--------+------+------+

.
.
.

By Using python and pandas, I can read all tables by pd.read_html(). But the nested tables should be linked to the main tables and as they have no identification that links them to the main table, other than that they are nested in the main table, I find no way to link them to the main table. Also, not all tables have nested table, so I cannot assume that all odd tables are main tables and all even tables are nested.

The approach I have thought of to solve this is to read the main tables in an array of data frames, transpose them and assign them a unique identifier, then read the nested tables in each data frame one by one assign them to a data frame and give them the unique identifier of the main table; then I can merge all main tables into 1 table and all the nested tables into another table. The final output will be two tables like this.

+---------+---------+---------+---------+---------+------------------+
| Header1 | Header2 | Header3 | Header4 | Header5 | UniqueIdentifier |
+---------+---------+---------+---------+---------+------------------+
| Value1  | Value2  | Value3  | Value4  | Value5  | ID1              |
| Value1  | Value2  | Value3  | Value4  | Value5  | ID2              |
| Value1  | Value2  | Value3  | Value4  | Value5  | ID3              |
| Value1  | Value2  | Value3  | Value4  | Value5  | ID4              |
+---------+---------+---------+---------+---------+------------------+

+--------+--------+--------+--------+------------------+
|  ColA  |  ColB  |  ColC  |  ColD  | UniqueIdentifier |
+--------+--------+--------+--------+------------------+
| ValueA | ValueB | ValueC | ValueD | ID1              |
| ValueA | ValueB | ValueC | ValueD | ID1              |
| ValueA | ValueB | ValueC | ValueD | ID2              |
| ValueA | ValueB | ValueC | ValueD | ID2              |
+--------+--------+--------+--------+------------------+

Is there a way to read all the main tables into a list, preserving the html table structure of the content, so then I can read all the nested ones? Or is there a better approach to solve this?

You may be better off doing the table passing by hand with beautifulsoup if you have deeply nested datasets — 2e0byo
– 2e0byo, Commented Sep 24, 2021 at 22:43

Saad Farooq · Accepted Answer · 2021-10-23 18:55:20Z

1

Based on the comment from @2e0byo, I was able to achieve it with a combination of beautifulsoup and pandas. Below is the solution:

from bs4 import BeautifulSoup
f = open("myFile.html", "r")
soup = BeautifulSoup(f, 'html.parser')
lst = soup.find_all('table', recursive=False)
    
maintables = []
cleanMainTables = []
cleanSubTables = []
i = 0

for table in lst:
    maintables.append(pd.read_html(str(table)))
    tmp = maintables[i][0].transpose().copy()
    # Some other transformations, changing column headers etc.
    tmp, tmp.columns = tmp[1:] , tmp.iloc[0]
    tmp.loc[:,'ID'] = i
    cleanMainTables.append(tmp)
    # if the sub-table exists
    if(len(maintables[i]) > 1):
        tmpSubTable = maintables[i][1].copy()
        tmpSubTable , tmpSubTable .columns = tmpSubTable [1:] , tmpSubTable .iloc[0]
        tmpSubTable .loc[:,'ID'] = i
        cleanSubTables.append(tmpSubTable)
    i += 1
    
df = pd.concat(cleanMainTables)
df2 = pd.concat(cleanSubTables)
df.to_excel("cleanMain.xlsx")
df2.to_excel("cleanSub.xlsx")

edited Oct 23, 2021 at 18:55

answered Sep 25, 2021 at 13:18

Saad Farooq

9972 gold badges14 silver badges28 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

2e0byo Over a year ago

Nice! Though I suspect beautifulsoup rather than soap. I rather like the latter idea, though

Collectives™ on Stack Overflow

Read a list of Nested HTML Tables in Python

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related