How to read html table in pandas and output to dataframe not list

Question

I am reading html table from html file into pandas, and want to get it as a dataframe not a list so that I can perform general dataframe operations.

I am facing error as below whenever I try anything except for printing whole dataframe.

print(dfdefault.shape())
AttributeError: 'list' object has no attribute 'shape'

to expand on @anky_91's suggestion... pd.read_html returns a list of dataframes. If only one table existed in the html you parsed then that list will only have one element in it. That is why they suggested dfdefault[0] because that gets the first item in the list which will be a dataframe. — piRSquared
– piRSquared, Commented May 1, 2019 at 14:52
@BrianCohan importing it by - dfdefault = pd.read_html(file, header = 0, match='Client Inventory Details') — Abhinav Kumar
– Abhinav Kumar, Commented May 1, 2019 at 14:55

Cohan · Accepted Answer · 2019-05-01 15:06:24Z

7

Pandas .read_html() function will return a list of dataframes where each dataframe is a table found on the page. Using StackOverflow's leagues, we can see that there are two tables on the right side of the page. As you can see below, a list is what read_html() is returning.

url = 'https://stackexchange.com/leagues/1/alltime/stackoverflow'
df_list = pd.read_html(url)
print(df_list)
# [  Rep Change*   Users <-- first table
# 0     10,000+   15477
# 1      5,000+   33541
# 2      2,500+   68129
# 3      1,000+  155430
# 4        500+  272683
# 5        250+  429742
# 6        100+  458600
# 7         50+  458600
# 8          1+  458600,
#    Total Rep*     Users <-- second table
# 0    100,000+       697
# 1     50,000+      1963
# 2     25,000+      5082
# 3     10,000+     15477
# 4      5,000+     33541
# 5      3,000+     56962
# 6      2,000+     84551
# 7      1,000+    155430
# 8        500+    272683
# 9        200+    458600
# 10         1+  10381503]

print(len(df_list))
# 2

From here, you just need to specify which table you want to work with. If there's only one table, it's pretty easy to figure out which one to use.

df = df_list[0]
print(df)
#   Rep Change*   Users
# 0     10,000+   15477
# 1      5,000+   33541
# 2      2,500+   68129
# 3      1,000+  155430
# 4        500+  272683
# 5        250+  429742
# 6        100+  458600
# 7         50+  458600
# 8          1+  458600
print(df.shape)
# (9, 2)

answered May 1, 2019 at 15:06

Cohan

4,5942 gold badges25 silver badges40 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Abhinav Kumar Over a year ago

While I am able to read the html table into pandas now, but very small tables, not the tables which has some 10000 records. Any suggestions ?

Cohan Over a year ago

Is this a page that you can share with me? I can try and see what I can figure out. If read_html() doesn't work, the next thing that I would try is using BeautifulSoup to parse it and add one row at a time to the DataFrame. But I have not tried that, I don't know if it will solve your issue, or the nature of why your program isn't allowing you to read in 10,000 records from the table. Another thing I have done in the past is look and see if the table is being generated by some JSON file and seeing if I can pull that instead of the rendered html.

Abhinav Kumar Over a year ago

Thanks for help, I will try . Sorry the file cant be shared, else I would already have.

Collectives™ on Stack Overflow

How to read html table in pandas and output to dataframe not list

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related