0

I wonder if there is a way to extract a dataframe from a nested Html structure/ code like the Pandas read_html() method.

Here is the html from which i need to extract all columns that come after the column "action". Data.html:

<table border="1"><tr><th>Central Repository</th><td><table border="1"><tr><th>Passadena-USA</th><td><table border="1"><tr><th>Fairfax Av.</th><td><table border="1"><tr><th>CMS</th><td><table border="1"><tr><th>action</th><th>address</th><th>machinie_id</th><th>portal</th><th>supplier</th><th>created_by</th><th>date</th><th>portal deficit</th><th>Load Value 1</th><th>Load Value 2</th><th>Load Value 3</th><th>Load Value 4</th><th>Load Value 5</th><th>Sub Load 1</th><th>Sub Load 2</th><th>Sub Load 3</th><th>Sub Load 4</th><th>Sub Load 5</th><th>Coordinates</th><th>Area Code</th><th>pending case id</th><th>project details</th><th>identification number APAC</th><th>site_id</th><th>state</th><th>status</th><th>timestamp</th></tr><tr><td>FP</td><td>1195 Fairfax Avenue </td><td>ZEBA 5841</td><td>NHE-9850</td><td>CMS</td><td>Administrator</td><td>2017/6/19</td><td>687965</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>Relay 4-12 Avery J</td><td>Tonal One B</td><td>2602700</td><td>Tertiary Node</td><td>0</td><td>Volume Sub < 1</td><td>passadena</td><td>PA</td><td>2017/06/19 17:35:56</td></tr></table></td></tr></table></td></tr></table></td></tr></table></td></tr></table>

Here is my Python Code:

import pandas as pd
df = pd.read_html(Data.html)
print(df[3])
# shouldn't the index 3 return all the columns that come after "CMS" 

i.e... columns : action,address, machine_id, portal.....till timestamp Here is the snap of my html page: enter image description here

1 Answer 1

1

shouldn't the index 3 return all the columns that come after "CMS"

What should be mentioned is that pd.read_html function returns

dfs : list of DataFrames

and df[3] would just contain the one of those dataframes.


To use table-header cells (<th>action</th><th>address</th><th>machinie_id</th>....) as column names - set header option to 1 (row number).

header : int or list-like or None, optional
The row (or list of rows for a :class:~pandas.MultiIndex) to use to make the columns headers.

Test:

In [21]: df = pd.read_html('data.html', header=1)

In [22]: df[3].columns
Out[22]: 
Index(['action', 'address', 'machinie_id', 'portal', 'supplier', 'created_by',
       'date', 'portal deficit', 'Load Value 1', 'Load Value 2',
       'Load Value 3', 'Load Value 4', 'Load Value 5', 'Sub Load 1',
       'Sub Load 2', 'Sub Load 3', 'Sub Load 4', 'Sub Load 5', 'Coordinates',
       'Area Code', 'pending case id', 'project details',
       'identification number APAC', 'site_id', 'state', 'status', 'timestamp',
       'Unnamed: 27', 'Unnamed: 28', 'Unnamed: 29', 'Unnamed: 30',
       'Unnamed: 31', 'Unnamed: 32', 'Unnamed: 33', 'Unnamed: 34',
       'Unnamed: 35', 'Unnamed: 36', 'Unnamed: 37', 'Unnamed: 38',
       'Unnamed: 39', 'Unnamed: 40', 'Unnamed: 41', 'Unnamed: 42',
       'Unnamed: 43', 'Unnamed: 44', 'Unnamed: 45', 'Unnamed: 46',
       'Unnamed: 47', 'Unnamed: 48', 'Unnamed: 49', 'Unnamed: 50',
       'Unnamed: 51', 'Unnamed: 52', 'Unnamed: 53', 'Unnamed: 54',
       'Unnamed: 55'],
      dtype='object')

In [23]: 
Sign up to request clarification or add additional context in comments.

3 Comments

I have two questions One: How would I be able to extract the values from the aforementioned columns, I.e.. Only header cols from "action" to "timestamp" and discarding all "Unnamed" ones , also would this work if say I have a similar nested HTML structure but with main columns coming not in the 3rd data frame but in the higher orders say 6 or 7 levels down , would I be able to use a regex on the entire HTML at once so as to extract the relevant column's and their values
use .loc method: df[3].loc[:, 'action':'timestamp']. As for regex matching consider match option of pd.read_html function
Thanks , the solution worked like a charm and helped me a lot

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.