0

I have a huge list of HTML tables without any id or attributes that I need to export into a friendly format (CSVs/Excel).

The HTML format is as follows:

+---------+--------+------+------+
|  Col1   |  Col2  |      |      |
+---------+--------+------+------+
| Header1 | Value1 |      |      |
| Header2 | Value2 |      |      |
| Header3 |        |      |      |
|+---------+--------+------+------+
|| ColA   | ColB   |ColC  | ColD |
|+---------+--------+------+------+
|| ValueA | ValueB |ValueC|ValueD|
|| ValueA | ValueB |ValueC|ValueD|
|+---------+--------+------+------+
+---------+--------+------+------+

+---------+--------+------+------+
|  Col1   |  Col2  |      |      |
+---------+--------+------+------+
| Header1 | Value1 |      |      |
| Header2 | Value2 |      |      |
| Header3 | Value3 |      |      |
| Header4 |        |      |      |
|+---------+--------+------+------+
|| ColA   | ColB   |ColC  | ColD |
|+---------+--------+------+------+
|| ValueA | ValueB |ValueC|ValueD|
|| ValueA | ValueB |ValueC|ValueD|
|+---------+--------+------+------+
+---------+--------+------+------+

.
.
.

By Using python and pandas, I can read all tables by pd.read_html(). But the nested tables should be linked to the main tables and as they have no identification that links them to the main table, other than that they are nested in the main table, I find no way to link them to the main table. Also, not all tables have nested table, so I cannot assume that all odd tables are main tables and all even tables are nested.

The approach I have thought of to solve this is to read the main tables in an array of data frames, transpose them and assign them a unique identifier, then read the nested tables in each data frame one by one assign them to a data frame and give them the unique identifier of the main table; then I can merge all main tables into 1 table and all the nested tables into another table. The final output will be two tables like this.

+---------+---------+---------+---------+---------+------------------+
| Header1 | Header2 | Header3 | Header4 | Header5 | UniqueIdentifier |
+---------+---------+---------+---------+---------+------------------+
| Value1  | Value2  | Value3  | Value4  | Value5  | ID1              |
| Value1  | Value2  | Value3  | Value4  | Value5  | ID2              |
| Value1  | Value2  | Value3  | Value4  | Value5  | ID3              |
| Value1  | Value2  | Value3  | Value4  | Value5  | ID4              |
+---------+---------+---------+---------+---------+------------------+

+--------+--------+--------+--------+------------------+
|  ColA  |  ColB  |  ColC  |  ColD  | UniqueIdentifier |
+--------+--------+--------+--------+------------------+
| ValueA | ValueB | ValueC | ValueD | ID1              |
| ValueA | ValueB | ValueC | ValueD | ID1              |
| ValueA | ValueB | ValueC | ValueD | ID2              |
| ValueA | ValueB | ValueC | ValueD | ID2              |
+--------+--------+--------+--------+------------------+

Is there a way to read all the main tables into a list, preserving the html table structure of the content, so then I can read all the nested ones? Or is there a better approach to solve this?

1
  • 3
    You may be better off doing the table passing by hand with beautifulsoup if you have deeply nested datasets Commented Sep 24, 2021 at 22:43

1 Answer 1

1

Based on the comment from @2e0byo, I was able to achieve it with a combination of beautifulsoup and pandas. Below is the solution:

from bs4 import BeautifulSoup
f = open("myFile.html", "r")
soup = BeautifulSoup(f, 'html.parser')
lst = soup.find_all('table', recursive=False)
    
maintables = []
cleanMainTables = []
cleanSubTables = []
i = 0

for table in lst:
    maintables.append(pd.read_html(str(table)))
    tmp = maintables[i][0].transpose().copy()
    # Some other transformations, changing column headers etc.
    tmp, tmp.columns = tmp[1:] , tmp.iloc[0]
    tmp.loc[:,'ID'] = i
    cleanMainTables.append(tmp)
    # if the sub-table exists
    if(len(maintables[i]) > 1):
        tmpSubTable = maintables[i][1].copy()
        tmpSubTable , tmpSubTable .columns = tmpSubTable [1:] , tmpSubTable .iloc[0]
        tmpSubTable .loc[:,'ID'] = i
        cleanSubTables.append(tmpSubTable)
    i += 1
    
df = pd.concat(cleanMainTables)
df2 = pd.concat(cleanSubTables)
df.to_excel("cleanMain.xlsx")
df2.to_excel("cleanSub.xlsx")
Sign up to request clarification or add additional context in comments.

1 Comment

Nice! Though I suspect beautifulsoup rather than soap. I rather like the latter idea, though

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.