2

I am reading html table from html file into pandas, and want to get it as a dataframe not a list so that I can perform general dataframe operations.

I am facing error as below whenever I try anything except for printing whole dataframe.

print(dfdefault.shape())
AttributeError: 'list' object has no attribute 'shape'
4
  • How are you importing the html file? Commented May 1, 2019 at 14:45
  • 2
    use df=dfdefault[0] and df.shape() Commented May 1, 2019 at 14:48
  • 2
    to expand on @anky_91's suggestion... pd.read_html returns a list of dataframes. If only one table existed in the html you parsed then that list will only have one element in it. That is why they suggested dfdefault[0] because that gets the first item in the list which will be a dataframe. Commented May 1, 2019 at 14:52
  • @BrianCohan importing it by - dfdefault = pd.read_html(file, header = 0, match='Client Inventory Details') Commented May 1, 2019 at 14:55

1 Answer 1

7

Pandas .read_html() function will return a list of dataframes where each dataframe is a table found on the page. Using StackOverflow's leagues, we can see that there are two tables on the right side of the page. As you can see below, a list is what read_html() is returning.

url = 'https://stackexchange.com/leagues/1/alltime/stackoverflow'
df_list = pd.read_html(url)
print(df_list)
# [  Rep Change*   Users <-- first table
# 0     10,000+   15477
# 1      5,000+   33541
# 2      2,500+   68129
# 3      1,000+  155430
# 4        500+  272683
# 5        250+  429742
# 6        100+  458600
# 7         50+  458600
# 8          1+  458600,
#    Total Rep*     Users <-- second table
# 0    100,000+       697
# 1     50,000+      1963
# 2     25,000+      5082
# 3     10,000+     15477
# 4      5,000+     33541
# 5      3,000+     56962
# 6      2,000+     84551
# 7      1,000+    155430
# 8        500+    272683
# 9        200+    458600
# 10         1+  10381503]

print(len(df_list))
# 2

From here, you just need to specify which table you want to work with. If there's only one table, it's pretty easy to figure out which one to use.

df = df_list[0]
print(df)
#   Rep Change*   Users
# 0     10,000+   15477
# 1      5,000+   33541
# 2      2,500+   68129
# 3      1,000+  155430
# 4        500+  272683
# 5        250+  429742
# 6        100+  458600
# 7         50+  458600
# 8          1+  458600
print(df.shape)
# (9, 2)
Sign up to request clarification or add additional context in comments.

3 Comments

While I am able to read the html table into pandas now, but very small tables, not the tables which has some 10000 records. Any suggestions ?
Is this a page that you can share with me? I can try and see what I can figure out. If read_html() doesn't work, the next thing that I would try is using BeautifulSoup to parse it and add one row at a time to the DataFrame. But I have not tried that, I don't know if it will solve your issue, or the nature of why your program isn't allowing you to read in 10,000 records from the table. Another thing I have done in the past is look and see if the table is being generated by some JSON file and seeing if I can pull that instead of the rendered html.
Thanks for help, I will try . Sorry the file cant be shared, else I would already have.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.