Web Scraping table data in Python

Question

I am trying to scrape a table of data from a web page, all the tutorials I have found online are too specific and don't explain what each argument/element is so I can't figure out how to work it for my example. Any advice on where to find good tutorials to scrape this kind of data would be appreciated;

query = urllib.urlencode({'q': company})
page = requests.get('http://www.hoovers.com/company-information/company-search.html?term=company')
tree = html.fromstring(page.text)

table =tree.xpath('//[@id="shell"]/div/div/div[2]/div[5]/div[1]/div/div[1]')

#Can't get xpath correct
#This will create a list of companies:
companies = tree.xpath('//...') 
#This will create a list of locations
locations = tree.xpath('//....')

I have also tried:

hoover = 'http://www.hoovers.com/company-information/company-search.html?term=company'
req = urllib2.Request(hoover)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)

table = soup.find("table", { "class" : "clear data-table sortable-header dashed-table-tr alternate-rows" })

f = open('output.csv', 'w')
for row in table.findAll('tr'):
    f.write(','.join(''.join([str(i).replace(',','') for i in row.findAll('td',text=True) if i[0]!='&']).split('\n')[1;-1])+'\n')
f.close()

But I am getting an invalid syntax error on the second last line

cfraschetti · Accepted Answer · 2015-06-15 17:01:35Z

3

Yes, beautiful soup. Here is a quick example to get the names:

hoover = 'http://www.hoovers.com/company-information/company-search.html?term=company'
req = urllib2.Request(hoover)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page.text)
trs = soup.find("div", attrs={"class": "clear data-table sortable-header dashed-table-tr alternate-rows"}).find("table").findAll("tr")
for tr in trs:
    tds = tr.findAll("td")
    if len(tds) < 1:
        continue
    name = tds[0].text
    print name
f.close()

edited Jun 15, 2015 at 17:01

answered Jun 15, 2015 at 14:33

cfraschetti

1371 silver badge6 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

russell_i Over a year ago

Thanks! Very helpful, but I am trying to do the page source in this way rather than read in a html page as the aim is to build this into a function: hoovers = 'hoovers.com/company-information/…' req = urllib2.Request(hoovers) page = urllib2.urlopen(req) soup = BeautifulSoup(page) , but I get an Atttribute error that table has no attribute find when I run the third line of your solution!

cfraschetti Over a year ago

BeautifulSoup constructor accepts a stream or string so you should be able to pass page.text or a streamed version.

russell_i Over a year ago

Sorry I don't understand what you mean pass page.text or a streamed version?

cfraschetti Over a year ago

In your top example you are calling page.text which returns a string of the contents of the web request. This should be sufficient to call the BeautifulSoup constructor. So, I think what you want is soup = BeautifulSoup(page.text). I am not familiar with how you are using urllib so I can only speak up to the page.text portion of your example. If you can provide more of your code I can try it out.

Collectives™ on Stack Overflow

Web Scraping table data in Python

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related