I have a huge list of HTML tables without any id or attributes that I need to export into a friendly format (CSVs/Excel).
The HTML format is as follows:
+---------+--------+------+------+
| Col1 | Col2 | | |
+---------+--------+------+------+
| Header1 | Value1 | | |
| Header2 | Value2 | | |
| Header3 | | | |
|+---------+--------+------+------+
|| ColA | ColB |ColC | ColD |
|+---------+--------+------+------+
|| ValueA | ValueB |ValueC|ValueD|
|| ValueA | ValueB |ValueC|ValueD|
|+---------+--------+------+------+
+---------+--------+------+------+
+---------+--------+------+------+
| Col1 | Col2 | | |
+---------+--------+------+------+
| Header1 | Value1 | | |
| Header2 | Value2 | | |
| Header3 | Value3 | | |
| Header4 | | | |
|+---------+--------+------+------+
|| ColA | ColB |ColC | ColD |
|+---------+--------+------+------+
|| ValueA | ValueB |ValueC|ValueD|
|| ValueA | ValueB |ValueC|ValueD|
|+---------+--------+------+------+
+---------+--------+------+------+
.
.
.
By Using python and pandas, I can read all tables by pd.read_html(). But the nested tables should be linked to the main tables and as they have no identification that links them to the main table, other than that they are nested in the main table, I find no way to link them to the main table. Also, not all tables have nested table, so I cannot assume that all odd tables are main tables and all even tables are nested.
The approach I have thought of to solve this is to read the main tables in an array of data frames, transpose them and assign them a unique identifier, then read the nested tables in each data frame one by one assign them to a data frame and give them the unique identifier of the main table; then I can merge all main tables into 1 table and all the nested tables into another table. The final output will be two tables like this.
+---------+---------+---------+---------+---------+------------------+
| Header1 | Header2 | Header3 | Header4 | Header5 | UniqueIdentifier |
+---------+---------+---------+---------+---------+------------------+
| Value1 | Value2 | Value3 | Value4 | Value5 | ID1 |
| Value1 | Value2 | Value3 | Value4 | Value5 | ID2 |
| Value1 | Value2 | Value3 | Value4 | Value5 | ID3 |
| Value1 | Value2 | Value3 | Value4 | Value5 | ID4 |
+---------+---------+---------+---------+---------+------------------+
+--------+--------+--------+--------+------------------+
| ColA | ColB | ColC | ColD | UniqueIdentifier |
+--------+--------+--------+--------+------------------+
| ValueA | ValueB | ValueC | ValueD | ID1 |
| ValueA | ValueB | ValueC | ValueD | ID1 |
| ValueA | ValueB | ValueC | ValueD | ID2 |
| ValueA | ValueB | ValueC | ValueD | ID2 |
+--------+--------+--------+--------+------------------+
Is there a way to read all the main tables into a list, preserving the html table structure of the content, so then I can read all the nested ones? Or is there a better approach to solve this?