0

I am trying to scrape the ownership table from cnbc.com for a university project. I tried different solutions, but it looks like the table is not included in the HTML but rather retrieved anytime I open the URL with a web browser. I don't know how to fix it.

Any help?

This is my code:

from bs4 import BeautifulSoup
import requests
import urllib


url_to_scrape = 'http://data.cnbc.com/quotes/YHOO/tab/8'
response = requests.get(url_to_scrape).content
soup = BeautifulSoup(urllib.request.urlopen(url_to_scrape).read(), 'lxml')

for row in soup.find_all('table', {'class': 'shareholders dotsBelow'} ):
    print (row).string

I made few changes and this is the code

from bs4 import BeautifulSoup
import requests

url = 'http://apps.cnbc.com/view.asp?country=US&uid=stocks/ownership&symbol=YHOO.O'

response = requests.get(url).content
soup = BeautifulSoup(response, 'lxml')

for tbody in soup.find_all('tbody', id="tBody_institutions"):
    tds = tbody.find_all('td')
    print(tds[0].text, tds[1].text, tds[2].text)

However I only get the first row of the table, which is this one:

Filo (David)  70.7M $2,351,860,831

wonder how I can iterate through the table?

3
  • Try using your browser's page inspector tool to monitor any asynchronous requests the page makes. Maybe you can capture the url & parameters it uses to get that table's data. Commented Feb 3, 2016 at 17:03
  • if page use javascript to generate some data then you need use Selenium which control browser and browser can run javascript. (Requests and BS don't run javascript). Or you have to analyze files send from server to browser and find file which has expected data and get its url. You can use Developer tools in Chrome or Firebug in Firefox to analyze it manually. Commented Feb 3, 2016 at 17:04
  • 1
    now you have all tds in one list - try print( len(tds) ) and you see 60, so you have tds[59].text. First findall("tr") then use for loop to search td in every tr - see new code in my answer. Commented Feb 3, 2016 at 19:09

3 Answers 3

1

Using "Developer Tools" in Chrome I found that your page loads file

http://apps.cnbc.com/view.asp?country=US&uid=stocks/ownership&symbol=YHOO.O

which has expected data

from bs4 import BeautifulSoup
import requests

url = 'http://apps.cnbc.com/view.asp?country=US&uid=stocks/ownership&symbol=YHOO.O'

response = requests.get(url).content
soup = BeautifulSoup(response, 'lxml')

for row in soup.find_all('table', {'class': 'shareholders dotsBelow'} ):
    print(row.text)

Result (it returns many empty lines because HTML has many "\n"):

Name






Shares Held





Position Value






Percentage ofTotal Holdings
since 2/3/16







% Ownedof SharesOutstanding





TurnoverRating







Filo (David)
 70.7M
$2,351,860,831
+9%
7.5%
Low


The Vanguard ...
 49.2M
$1,422,524,414
+6%
5.2%
Low


State Street ...
 34.4M
$993,071,914
+5%
3.6%
Low


BlackRock ...
 32.3M
$935,173,655
+4%
3.4%
Low


Fidelity ...
 24.7M
$714,307,904
+3%
2.6%
Low


Goldman Sachs & ...
 18.6M
$538,561,672
+2%
2.0%
Low


Mason Capital ...
 16.4M
$472,832,995
+2%
1.7%
High


Capital Research ...
 12.6M
$365,108,090
+2%
1.3%
Low


TIAA-CREF
 10.9M
$315,255,311
+1%
1.2%
Low


T. Rowe Price ...
 10.8M
$310,803,286
+1%
1.1%
Low
















Name






Shares Held





Position Value






Percentage ofTotal Holdings
since 2/3/16







% Ownedof SharesOutstanding





InvestmentStyle







Vanguard Total ...
 15.6M
$518,104,623
+2%
1.7%
Index


Vanguard 500 ...
 10.6M
$352,795,106
+1%
1.1%
Index


Vanguard ...
 9.4M
$312,902,098
+1%
1.0%
Index


SPDR S&P 500 ETF
 8.8M
$292,985,112
+1%
0.9%
Index


PowerShares QQQ ...
 7.6M
$252,776,000
+1%
0.8%
Index


Statens ...
 6.7M
$338,173,390
+1%
0.7%
Core Value


First Trust DJ ...
 5.6M
$186,778,215
+1%
0.6%
Index


Janus Twenty Fund
 5.2M
$150,966,054
+1%
0.6%
Growth


CREF Stock Account
 5.0M
$195,517,452
+1%
0.5%
Core Growth


Vanguard Growth ...
 4.8M
$159,879,157
+1%
0.5%
Index

EDIT: better version

from bs4 import BeautifulSoup
import requests

url = 'http://apps.cnbc.com/view.asp?country=US&uid=stocks/ownership&symbol=YHOO.O'

response = requests.get(url).content
soup = BeautifulSoup(response, 'lxml')

for tbody in soup.find_all('tbody', id="tBody_institutions"):
    trs = tbody.find_all('tr')
    for tr in trs:
        tds = tr.find_all('td')
        print(tds[0].text, tds[1].text, tds[2].text)

and result

Filo (David)  70.7M $2,351,860,831
The Vanguard ...  49.2M $1,422,524,414
State Street ...  34.4M $993,071,914
BlackRock ...  32.3M $935,173,655
Fidelity ...  24.7M $714,307,904
Goldman Sachs & ...  18.6M $538,561,672
Mason Capital ...  16.4M $472,832,995
Capital Research ...  12.6M $365,108,090
TIAA-CREF  10.9M $315,255,311
T. Rowe Price ...  10.8M $310,803,286
Sign up to request clarification or add additional context in comments.

9 Comments

Hi Furas, Thanks for the answer. I did not know about the developer tools. I will keep it in mind. I am using your script and it is five minutes that is running but still no output. What could be happening? I am sorry to ask these silly questions, but I am a very very newbie of these sort of things.
On my computer it runs 2-3 seconds. Do you run it in console/terminal/cmd.exe to see any error messages ?
@user1463152 if you have new working version than add it in your question (with header "EDIT:") not to my answer :)
sorry. I did not mean it. Perhaps is better if I write here in the comments.
better add to your question (below current text). Code in comment is unreadable.
|
0

I'm not sure why you are using requests. In addition, the page you reference has no elements with the class "shareholders".

If you remove those two issues, the following code will print out all tables in the HTML:

from bs4 import BeautifulSoup
import urllib.request

url_to_scrape = 'http://data.cnbc.com/quotes/YHOO/tab/8'
soup = BeautifulSoup(urllib.request.urlopen(url_to_scrape).read())

for row in soup.find_all('table'):
    print(row)

4 Comments

Hi Thanks for the answer. That's what I get:AttributeError: module 'urllib' has no attribute 'urlopen'
That's odd, unless you are using a really really old version of python. What does python --version return?
I am using python 3.5.
Gotcha. I changed it to be python3 compatible.
0

If you want to use requests then do not mix them up with urllib and change your code to look like the following, because there is no class 'shareholders dotsBelow'

from bs4 import BeautifulSoup
import requests


url_to_scrape = 'http://data.cnbc.com/quotes/YHOO/tab/8'
response = requests.get(url_to_scrape).content
soup = BeautifulSoup(response)

for row in soup.find_all('table'):
    print row

EDIT:

your changed code can use the names as list:

from bs4 import BeautifulSoup
import requests

url = 'http://apps.cnbc.com/view.asp?country=US&uid=stocks/ownership&symbol=YHOO.O'

response = requests.get(url).content
soup = BeautifulSoup(response, 'lxml')

for tbody in soup.find_all('tbody', id="tBody_institutions"):
    tds = tbody.find_all('td')
    for  zahl,stuff in enumerate(tds):
        if tds[zahl].text in ['Filo (David)', 'The Vanguard ...','State Street ...','T. Rowe Price ...','BlackRock ...','Fidelity ...','Goldman Sachs & ...','Mason Capital ...', 'Capital Research ...','TIAA-CREF']:
            print(tds[zahl].text, tds[zahl + 1 ].text, tds[zahl + 2].text)

4 Comments

Hi, Thanks for the answer. I used your solution. It doesn't retrieve the table I want to, but just a part of the html that I do not need.
Ah ok, if you tell me what data you need, I can adjust the code or show you how to get the desired data
I am trying to extract tables from that webpage.
Can you maybe provide a screen shot or something or the exact datastructure you want to get the information out from? Because I thought you had a general problem to scap

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.