Extract Dataframe from nested html table structure?

Question

I wonder if there is a way to extract a dataframe from a nested Html structure/ code like the Pandas read_html() method.

Here is the html from which i need to extract all columns that come after the column "action". Data.html:

<table border="1"><tr><th>Central Repository</th><td><table border="1"><tr><th>Passadena-USA</th><td><table border="1"><tr><th>Fairfax Av.</th><td><table border="1"><tr><th>CMS</th><td><table border="1"><tr><th>action</th><th>address</th><th>machinie_id</th><th>portal</th><th>supplier</th><th>created_by</th><th>date</th><th>portal deficit</th><th>Load Value 1</th><th>Load Value 2</th><th>Load Value 3</th><th>Load Value 4</th><th>Load Value 5</th><th>Sub Load 1</th><th>Sub Load 2</th><th>Sub Load 3</th><th>Sub Load 4</th><th>Sub Load 5</th><th>Coordinates</th><th>Area Code</th><th>pending case id</th><th>project details</th><th>identification number APAC</th><th>site_id</th><th>state</th><th>status</th><th>timestamp</th></tr><tr><td>FP</td><td>1195 Fairfax Avenue </td><td>ZEBA 5841</td><td>NHE-9850</td><td>CMS</td><td>Administrator</td><td>2017/6/19</td><td>687965</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>Relay 4-12 Avery J</td><td>Tonal One B</td><td>2602700</td><td>Tertiary Node</td><td>0</td><td>Volume Sub < 1</td><td>passadena</td><td>PA</td><td>2017/06/19 17:35:56</td></tr></table></td></tr></table></td></tr></table></td></tr></table></td></tr></table>

Here is my Python Code:

import pandas as pd
df = pd.read_html(Data.html)
print(df[3])
# shouldn't the index 3 return all the columns that come after "CMS"

i.e... columns : action,address, machine_id, portal.....till timestamp Here is the snap of my html page:

RomanPerekhrest · Accepted Answer · 2018-06-24 15:38:55Z

1

shouldn't the index 3 return all the columns that come after "CMS"

What should be mentioned is that pd.read_html function returns

dfs : list of DataFrames

and df[3] would just contain the one of those dataframes.

To use table-header cells (<th>action</th><th>address</th><th>machinie_id</th>....) as column names - set header option to 1 (row number).

header : int or list-like or None, optional
The row (or list of rows for a :class:~pandas.MultiIndex) to use to make the columns headers.

Test:

In [21]: df = pd.read_html('data.html', header=1)

In [22]: df[3].columns
Out[22]: 
Index(['action', 'address', 'machinie_id', 'portal', 'supplier', 'created_by',
       'date', 'portal deficit', 'Load Value 1', 'Load Value 2',
       'Load Value 3', 'Load Value 4', 'Load Value 5', 'Sub Load 1',
       'Sub Load 2', 'Sub Load 3', 'Sub Load 4', 'Sub Load 5', 'Coordinates',
       'Area Code', 'pending case id', 'project details',
       'identification number APAC', 'site_id', 'state', 'status', 'timestamp',
       'Unnamed: 27', 'Unnamed: 28', 'Unnamed: 29', 'Unnamed: 30',
       'Unnamed: 31', 'Unnamed: 32', 'Unnamed: 33', 'Unnamed: 34',
       'Unnamed: 35', 'Unnamed: 36', 'Unnamed: 37', 'Unnamed: 38',
       'Unnamed: 39', 'Unnamed: 40', 'Unnamed: 41', 'Unnamed: 42',
       'Unnamed: 43', 'Unnamed: 44', 'Unnamed: 45', 'Unnamed: 46',
       'Unnamed: 47', 'Unnamed: 48', 'Unnamed: 49', 'Unnamed: 50',
       'Unnamed: 51', 'Unnamed: 52', 'Unnamed: 53', 'Unnamed: 54',
       'Unnamed: 55'],
      dtype='object')

In [23]:

answered Jun 24, 2018 at 15:38

RomanPerekhrest

93.1k4 gold badges75 silver badges112 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Carl Over a year ago

I have two questions One: How would I be able to extract the values from the aforementioned columns, I.e.. Only header cols from "action" to "timestamp" and discarding all "Unnamed" ones , also would this work if say I have a similar nested HTML structure but with main columns coming not in the 3rd data frame but in the higher orders say 6 or 7 levels down , would I be able to use a regex on the entire HTML at once so as to extract the relevant column's and their values

RomanPerekhrest Over a year ago

use .loc method: df[3].loc[:, 'action':'timestamp']. As for regex matching consider match option of pd.read_html function

Carl Over a year ago

Thanks , the solution worked like a charm and helped me a lot

Collectives™ on Stack Overflow

Extract Dataframe from nested html table structure?

I wonder if there is a way to extract a dataframe from a nested Html structure/ code like the Pandas read_html() method.

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

I wonder if there is a way to extract a dataframe from a nested Html structure/ code like the Pandas read_html() method.

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related