Cannot scrape a table with python

Question

I am trying to scrape the ownership table from cnbc.com for a university project. I tried different solutions, but it looks like the table is not included in the HTML but rather retrieved anytime I open the URL with a web browser. I don't know how to fix it.

Any help?

This is my code:

from bs4 import BeautifulSoup
import requests
import urllib


url_to_scrape = 'http://data.cnbc.com/quotes/YHOO/tab/8'
response = requests.get(url_to_scrape).content
soup = BeautifulSoup(urllib.request.urlopen(url_to_scrape).read(), 'lxml')

for row in soup.find_all('table', {'class': 'shareholders dotsBelow'} ):
    print (row).string

I made few changes and this is the code

from bs4 import BeautifulSoup
import requests

url = 'http://apps.cnbc.com/view.asp?country=US&uid=stocks/ownership&symbol=YHOO.O'

response = requests.get(url).content
soup = BeautifulSoup(response, 'lxml')

for tbody in soup.find_all('tbody', id="tBody_institutions"):
    tds = tbody.find_all('td')
    print(tds[0].text, tds[1].text, tds[2].text)

However I only get the first row of the table, which is this one:

Filo (David)  70.7M $2,351,860,831

wonder how I can iterate through the table?

Try using your browser's page inspector tool to monitor any asynchronous requests the page makes. Maybe you can capture the url & parameters it uses to get that table's data. — Kevin
– Kevin, Commented Feb 3, 2016 at 17:03
if page use javascript to generate some data then you need use Selenium which control browser and browser can run javascript. (Requests and BS don't run javascript). Or you have to analyze files send from server to browser and find file which has expected data and get its url. You can use Developer tools in Chrome or Firebug in Firefox to analyze it manually. — furas
– furas, Commented Feb 3, 2016 at 17:04
now you have all tds in one list - try print( len(tds) ) and you see 60, so you have tds[59].text. First findall("tr") then use for loop to search td in every tr - see new code in my answer. — furas
– furas, Commented Feb 3, 2016 at 19:09

furas · Accepted Answer · 2016-02-03 19:09:25Z

1

Using "Developer Tools" in Chrome I found that your page loads file

http://apps.cnbc.com/view.asp?country=US&uid=stocks/ownership&symbol=YHOO.O

which has expected data

from bs4 import BeautifulSoup
import requests

url = 'http://apps.cnbc.com/view.asp?country=US&uid=stocks/ownership&symbol=YHOO.O'

response = requests.get(url).content
soup = BeautifulSoup(response, 'lxml')

for row in soup.find_all('table', {'class': 'shareholders dotsBelow'} ):
    print(row.text)

Result (it returns many empty lines because HTML has many "\n"):

Name






Shares Held





Position Value






Percentage ofTotal Holdings
since 2/3/16







% Ownedof SharesOutstanding





TurnoverRating







Filo (David)
 70.7M
$2,351,860,831
+9%
7.5%
Low


The Vanguard ...
 49.2M
$1,422,524,414
+6%
5.2%
Low


State Street ...
 34.4M
$993,071,914
+5%
3.6%
Low


BlackRock ...
 32.3M
$935,173,655
+4%
3.4%
Low


Fidelity ...
 24.7M
$714,307,904
+3%
2.6%
Low


Goldman Sachs & ...
 18.6M
$538,561,672
+2%
2.0%
Low


Mason Capital ...
 16.4M
$472,832,995
+2%
1.7%
High


Capital Research ...
 12.6M
$365,108,090
+2%
1.3%
Low


TIAA-CREF
 10.9M
$315,255,311
+1%
1.2%
Low


T. Rowe Price ...
 10.8M
$310,803,286
+1%
1.1%
Low
















Name






Shares Held





Position Value






Percentage ofTotal Holdings
since 2/3/16







% Ownedof SharesOutstanding





InvestmentStyle







Vanguard Total ...
 15.6M
$518,104,623
+2%
1.7%
Index


Vanguard 500 ...
 10.6M
$352,795,106
+1%
1.1%
Index


Vanguard ...
 9.4M
$312,902,098
+1%
1.0%
Index


SPDR S&P 500 ETF
 8.8M
$292,985,112
+1%
0.9%
Index


PowerShares QQQ ...
 7.6M
$252,776,000
+1%
0.8%
Index


Statens ...
 6.7M
$338,173,390
+1%
0.7%
Core Value


First Trust DJ ...
 5.6M
$186,778,215
+1%
0.6%
Index


Janus Twenty Fund
 5.2M
$150,966,054
+1%
0.6%
Growth


CREF Stock Account
 5.0M
$195,517,452
+1%
0.5%
Core Growth


Vanguard Growth ...
 4.8M
$159,879,157
+1%
0.5%
Index

EDIT: better version

from bs4 import BeautifulSoup
import requests

url = 'http://apps.cnbc.com/view.asp?country=US&uid=stocks/ownership&symbol=YHOO.O'

response = requests.get(url).content
soup = BeautifulSoup(response, 'lxml')

for tbody in soup.find_all('tbody', id="tBody_institutions"):
    trs = tbody.find_all('tr')
    for tr in trs:
        tds = tr.find_all('td')
        print(tds[0].text, tds[1].text, tds[2].text)

and result

Filo (David)  70.7M $2,351,860,831
The Vanguard ...  49.2M $1,422,524,414
State Street ...  34.4M $993,071,914
BlackRock ...  32.3M $935,173,655
Fidelity ...  24.7M $714,307,904
Goldman Sachs & ...  18.6M $538,561,672
Mason Capital ...  16.4M $472,832,995
Capital Research ...  12.6M $365,108,090
TIAA-CREF  10.9M $315,255,311
T. Rowe Price ...  10.8M $310,803,286

edited Feb 3, 2016 at 19:09

answered Feb 3, 2016 at 17:16

furas

149k12 gold badges121 silver badges171 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

user1463152 Over a year ago

Hi Furas, Thanks for the answer. I did not know about the developer tools. I will keep it in mind. I am using your script and it is five minutes that is running but still no output. What could be happening? I am sorry to ask these silly questions, but I am a very very newbie of these sort of things.

furas Over a year ago

On my computer it runs 2-3 seconds. Do you run it in console/terminal/cmd.exe to see any error messages ?

furas Over a year ago

@user1463152 if you have new working version than add it in your question (with header "EDIT:") not to my answer :)

user1463152 Over a year ago

sorry. I did not mean it. Perhaps is better if I write here in the comments.

furas Over a year ago

better add to your question (below current text). Code in comment is unreadable.

|

Brian · Accepted Answer · 2016-02-03 19:17:50Z

0

I'm not sure why you are using requests. In addition, the page you reference has no elements with the class "shareholders".

If you remove those two issues, the following code will print out all tables in the HTML:

from bs4 import BeautifulSoup
import urllib.request

url_to_scrape = 'http://data.cnbc.com/quotes/YHOO/tab/8'
soup = BeautifulSoup(urllib.request.urlopen(url_to_scrape).read())

for row in soup.find_all('table'):
    print(row)

edited Feb 3, 2016 at 19:17

answered Feb 3, 2016 at 17:06

Brian

2,2622 gold badges17 silver badges24 bronze badges

4 Comments

user1463152 Over a year ago

Hi Thanks for the answer. That's what I get:AttributeError: module 'urllib' has no attribute 'urlopen'

Brian Over a year ago

That's odd, unless you are using a really really old version of python. What does python --version return?

user1463152 Over a year ago

I am using python 3.5.

Brian Over a year ago

Gotcha. I changed it to be python3 compatible.

dima · Accepted Answer · 2016-02-03 19:24:48Z

0

If you want to use requests then do not mix them up with urllib and change your code to look like the following, because there is no class 'shareholders dotsBelow'

from bs4 import BeautifulSoup
import requests


url_to_scrape = 'http://data.cnbc.com/quotes/YHOO/tab/8'
response = requests.get(url_to_scrape).content
soup = BeautifulSoup(response)

for row in soup.find_all('table'):
    print row

EDIT:

your changed code can use the names as list:

from bs4 import BeautifulSoup
import requests

url = 'http://apps.cnbc.com/view.asp?country=US&uid=stocks/ownership&symbol=YHOO.O'

response = requests.get(url).content
soup = BeautifulSoup(response, 'lxml')

for tbody in soup.find_all('tbody', id="tBody_institutions"):
    tds = tbody.find_all('td')
    for  zahl,stuff in enumerate(tds):
        if tds[zahl].text in ['Filo (David)', 'The Vanguard ...','State Street ...','T. Rowe Price ...','BlackRock ...','Fidelity ...','Goldman Sachs & ...','Mason Capital ...', 'Capital Research ...','TIAA-CREF']:
            print(tds[zahl].text, tds[zahl + 1 ].text, tds[zahl + 2].text)

edited Feb 3, 2016 at 19:24

answered Feb 3, 2016 at 17:12

dima

1584 silver badges14 bronze badges

4 Comments

user1463152 Over a year ago

Hi, Thanks for the answer. I used your solution. It doesn't retrieve the table I want to, but just a part of the html that I do not need.

dima Over a year ago

Ah ok, if you tell me what data you need, I can adjust the code or show you how to get the desired data

user1463152 Over a year ago

I am trying to extract tables from that webpage.

dima Over a year ago

Can you maybe provide a screen shot or something or the exact datastructure you want to get the information out from? Because I thought you had a general problem to scap

Collectives™ on Stack Overflow

Cannot scrape a table with python

3 Answers 3

9 Comments

4 Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

9 Comments

4 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related