3

So I am trying to get some data from a website. And I'm having a hard time getting the data. I can get the player names but thats about it at this point. Been trying different things coming up short. Here is sample code that i'm trying to go through. Note that there are two tables (one for each team). And the class for each player alternates from "even" to "odd" or "odd" to "even" example html file below followed by my python script. I labeled which parts I want. I am also using python 2.7

`<table id="nbaGITeamStats" cellpadding="0" cellspacing="0">
      <thead class="nbaGIClippers">
         <tr>
            <th colspan="17">Los Angeles Clippers (1-0)</th> <!-- I want team name  -->
         </tr>
      </thead>
      <tbody><tr colspan="17">
         <td colspan="17" class="nbaGIBoxCat"><span>field goals</span><span>rebounds</span></td>
      </tr>
      <tr>
     <td class="nbaGITeamHdrStatsNoBord" colspan="1">&nbsp;</td>
     <td class="nbaGITeamHdrStats">pos</td>
     <td class="nbaGITeamHdrStats">min</td>
     <td class="nbaGITeamHdrStats">fgm-a</td>
     <td class="nbaGITeamHdrStats">3pm-a</td>
     <td class="nbaGITeamHdrStats">ftm-a</td>
     <td class="nbaGITeamHdrStats">+/-</td>
     <td class="nbaGITeamHdrStats">off</td>
     <td class="nbaGITeamHdrStats">def</td>
     <td class="nbaGITeamHdrStats">tot</td>
     <td class="nbaGITeamHdrStats">ast</td>
     <td class="nbaGITeamHdrStats">pf</td>
     <td class="nbaGITeamHdrStats">st</td>
     <td class="nbaGITeamHdrStats">to</td>
     <td class="nbaGITeamHdrStats">bs</td>
     <td class="nbaGITeamHdrStats">ba</td>
     <td class="nbaGITeamHdrStats">pts</td>
  </tr>
  <tr class="odd">
     <td id="nbaGIBoxNme" class="b"><a href="/playerfile/paul_pierce/index.html">P. Pierce</a></td> <!-- I want player name  -->
     <td class="nbaGIPosition">F</td> <!-- I want position name  -->
     <td>14:16</td> <!-- I want this  -->
     <td>1-4</td>  <!-- I want this  -->
     <td>1-2</td>  <!-- I want this  -->
     <td>2-2</td>  <!-- I want this  -->
     <td>+12</td>  <!-- I want this  -->
     <td>1</td>  <!-- I want this  -->
     <td>0</td>  <!-- I want this  -->
     <td>1</td>  <!-- I want this  -->
     <td>1</td>  <!-- I want this  -->
     <td>3</td>  <!-- I want this  -->
     <td>2</td>  <!-- I want this  -->
     <td>0</td>  <!-- I want this  -->
     <td>0</td>  <!-- I want this  -->
     <td>0</td>  <!-- I want this  -->
     <td>5</td>  <!-- I want this  -->
  </tr>

  <tr class="even">
     <td id="nbaGIBoxNme" class="b"><a href="/playerfile/blake_griffin/index.html">B. Griffin</a></td>  <!-- I want this  -->
     <td class="nbaGIPosition">F</td>  <!-- I want this  -->
     <td>26:19</td>  <!-- I want this  -->
     <td>5-14</td>  <!-- I want this  -->
     <td>0-1</td>  <!-- I want this  -->
     <td>1-1</td>  <!-- I want this  -->
     <td>+14</td>  <!-- I want this  -->
     <td>0</td>  <!-- I want this  -->
     <td>5</td>  <!-- I want this  -->
     <td>5</td>  <!-- I want this  -->
     <td>2</td>  <!-- I want this  -->
     <td>1</td>  <!-- I want this  -->
     <td>1</td>  <!-- I want this  -->
     <td>1</td>  <!-- I want this  -->
     <td>1</td>  <!-- I want this  -->
     <td>1</td>  <!-- I want this  -->
     <td>11</td>  <!-- I want this  -->
  </tr>
  <tr class="odd">
     <td id="nbaGIBoxNme" class="b"><a href="/playerfile/deandre_jordan/index.html">D. Jordan</a></td>  <!-- I want this  -->
     <td class="nbaGIPosition">C</td>  <!-- I want this  -->
     <td>26:27</td>  <!-- I want this  -->
     <td>6-7</td>  <!-- I want this  -->
     <td>0-0</td>  <!-- I want this  -->
     <td>3-5</td>  <!-- I want this  -->
     <td>+19</td>  <!-- I want this  -->
     <td>1</td>  <!-- I want this  -->
     <td>11</td>  <!-- I want this  -->
     <td>12</td>  <!-- I want this  -->
     <td>0</td>  <!-- I want this  -->
     <td>1</td>  <!-- I want this  -->
     <td>0</td>  <!-- I want this  -->
     <td>2</td>  <!-- I want this  -->
     <td>3</td>  <!-- I want this  -->
     <td>0</td>  <!-- I want this  -->
     <td>15</td>  <!-- I want this  -->
  </tr>
   <!-- And so on it will keep changing class from odd to even, even to odd  -->
    <!-- Also note there are to tables one for each team  -->
   <!--this is he table id>>> <table id="nbaGITeamStats" cellpadding="0" cellspacing="0"> -->`

This was long but i wanted to give an example of the classes switching up here is my python script I plan to use a dictionary to save the data once I actually scrape it successfully.

import urllib
import urllib2
from bs4 import BeautifulSoup
import re
gamesForDay = ['/games/20151002/DENLAC/gameinfo.html']
for game in gamesForDay:
   url =  "http://www.nba.com/"+game
   page = urllib2.urlopen(url).read()
   soup = BeautifulSoup(page)
   for tr in soup.find_all('table id="nbaGITeamStats'):
    tds = tr.find_all('td')
    print tds

3 Answers 3

2

Here my solution. Note that I have a slightly different version of BeautifulSoup, not one coming from bs4, but the logic might not be too off. Still on Python2.7 (on Windows in my case).

You will likely need to fix some nuances for player sections that are not as you display above, but I think you'll be able to handle that part :-)

import urllib
import urllib2
# from bs4 import BeautifulSoup
from BeautifulSoup import BeautifulSoup
import re
gamesForDay = ['/games/20151002/DENLAC/gameinfo.html']
for game in gamesForDay:
   url =  "http://www.nba.com/"+game
   page = urllib2.urlopen(url).read()
   soup = BeautifulSoup(page)

   # fetch the tables you are interested in
   tables = soup.findAll(id="nbaGITeamStats")
   for table in tables:
       team_name = table.thead.tr.th.text
       # odd/even class rows (tr)
       rows = [ x for x in table.findAll('tr') if x.get('class',None) in ['odd','even'] ]
       for player in rows:
           # search the row cols based on 'id'
           player_name = player.find('td', attrs={'id':'nbaGIBoxNme'}).text

           # search the row cols based on 'class'
           player_position = player.find('td', attrs={'class':'nbaGIPosition'}).text

           # search for all td where the class is not defined
           player_numbers = [ x.text for x in player.findAll('td', attrs={'class':None})]

           print player_name, player_position, player_numbers

With bs4 (BeautifulSoup4 as I learned) some modifications had to be done. You still have to handle some stuff, but this extract most of the data you want:

import urllib
import urllib2
from bs4 import BeautifulSoup
import re
gamesForDay = ['/games/20151002/DENLAC/gameinfo.html']
for game in gamesForDay:
   url =  "http://www.nba.com/"+game
   page = urllib2.urlopen(url).read()
   soup = BeautifulSoup(page, "html.parser")

   # fetch the tables you are interested in
   tables = soup.findAll(id="nbaGITeamStats")
   for table in tables:
       team_name = table.thead.tr.th.text
       # odd/even class rows (tr)
       rows = table.find_all(attrs={'class':'odd'})
       rows.extend(table.find_all(attrs={'class':'even'}))

       for player in rows:
           # search the row cols based on 'id'
           player_name = player.find('td', attrs={'id':'nbaGIBoxNme'}).text

           # search the row cols based on 'class'
           player_position = player.find('td', attrs={'class':'nbaGIPosition'}).text

           # search for all td where the class is not defined
           player_numbers = [ x.text for x in player.findAll('td', attrs={'class':None})]

           print player_name, player_position, player_numbers
Sign up to request clarification or add additional context in comments.

10 Comments

it seems as if this is trying to pin point it down like i want but i don't think it works with my version of beatifulsoup i'll try to tweak it a bit though thanks for the reply
If it helps, I got Beautiful soup installed via pip install BeautifulSoup. I am on Windows 10, Python 2.7.
for some reason this doesn't print anything out. I used the second part you supplied. but I get nothing printed.
I can print the tables and it shows the data. I then can print the team_name. but when i go down to rows it shows empty list. and if i put team_name at the bottom where player_name and everything else it for some reason prints nothing.
That's odd. I copied the code verbatim and runs fine for me, up until it breaks (I mentioned you'll need to fix some stuff) but it does print this much pastebin.com/15HjtH5Q
|
2

It's correct to write this way:

for tr in soup.find_all('table', id='nbaGITeamStats')

That works fine for me (python 3.4):

>>> import requests
>>> from bs4 import BeautifulSoup
>>> gamesForDay = ['/games/20151002/DENLAC/gameinfo.html']
>>> 
>>> for game in gamesForDay:
...    url =  "http://www.nba.com/"+game
...    page = requests.get(url).content
...    soup = BeautifulSoup(page, 'html.parser')
...    for tr in soup.find_all('table', id='nbaGITeamStats'):
...        tds = tr.find_all('td')
...        print(tds)

To access content inside td tag use .text, like this:

for td in tds:
   print(td.text)

2 Comments

Thank you this works for the tds, I'm trying to figure out how to get the td exactly like between <td>14:16</td> is there a way to pin point by number of td's?
yep, you can access 14:16 by calling .text on needed td. Just count which one you need or make some conditions to get it.
1

So here is what I did to get everything to come up. Of course I'll have to clean the code from here and this was with great help from sal.

import urllib2
from bs4 import BeautifulSoup
import re
gamesForDay = ['/games/20151002/DENLAC/gameinfo.html']
for game in gamesForDay:
   url =  "http://www.nba.com/"+game
   page = urllib2.urlopen(url).read()
   soup = BeautifulSoup(page, "html.parser")

   # fetch the tables you are interested in
   tables = soup.findAll(id="nbaGITeamStats")
   for table in tables:
        team_name = table.thead.tr.th.text
        # odd/even class rows (tr)
        rowsodd = table.find_all(attrs={'class':'odd'})
        rowseven =table.find_all(attrs={'class':'even'})

        for player in rowsodd:
            # search the row cols based on 'id'
            player_name = player.find('td', attrs={'id':'nbaGIBoxNme'}).text

            # search the row cols based on 'class'
            #player_position = player.find('td', attrs={'class':'nbaGIPosition'}).text
            #^THERE ARE ONLY POSITIONS PUT ON PLAYERS AFTER THEY ARE PUT IN THE GAME.
            # search for all td where the class is not defined
            player_numbers = [ x.text for x in player.findAll('td', attrs={'class':None})]

            print player_name, player_numbers
        for player in rowseven:
            # search the row cols based on 'id'
            player_name = player.find('td', attrs={'id':'nbaGIBoxNme'}).text

            # search the row cols based on 'class'
            #player_position = player.find('td', attrs={'class':'nbaGIPosition'}).text
             #^THERE ARE ONLY POSITIONS PUT ON PLAYERS AFTER THEY ARE PUT IN THE GAME.
            # search for all td where the class is not defined
            player_numbers = [ x.text for x in player.findAll('td', attrs={'class':None})]
            print player_name, player_numbers

Everything now shows up. and I will have to clean it up a bit better. But the data is a lot cleaner. I actually never used Beautiful soup as you can tell from the question. Two rows was needed or perhaps someone knows a better way this was easiest for me to get the data I was looking for always looking to improve though. I hope someone else learns from this.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.