3

I am trying to scrape a table of data from a web page, all the tutorials I have found online are too specific and don't explain what each argument/element is so I can't figure out how to work it for my example. Any advice on where to find good tutorials to scrape this kind of data would be appreciated;

query = urllib.urlencode({'q': company})
page = requests.get('http://www.hoovers.com/company-information/company-search.html?term=company')
tree = html.fromstring(page.text)

table =tree.xpath('//[@id="shell"]/div/div/div[2]/div[5]/div[1]/div/div[1]')

#Can't get xpath correct
#This will create a list of companies:
companies = tree.xpath('//...') 
#This will create a list of locations
locations = tree.xpath('//....')

I have also tried:

hoover = 'http://www.hoovers.com/company-information/company-search.html?term=company'
req = urllib2.Request(hoover)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)

table = soup.find("table", { "class" : "clear data-table sortable-header dashed-table-tr alternate-rows" })

f = open('output.csv', 'w')
for row in table.findAll('tr'):
    f.write(','.join(''.join([str(i).replace(',','') for i in row.findAll('td',text=True) if i[0]!='&']).split('\n')[1;-1])+'\n')
f.close()    

But I am getting an invalid syntax error on the second last line

1 Answer 1

3

Yes, beautiful soup. Here is a quick example to get the names:

hoover = 'http://www.hoovers.com/company-information/company-search.html?term=company'
req = urllib2.Request(hoover)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page.text)
trs = soup.find("div", attrs={"class": "clear data-table sortable-header dashed-table-tr alternate-rows"}).find("table").findAll("tr")
for tr in trs:
    tds = tr.findAll("td")
    if len(tds) < 1:
        continue
    name = tds[0].text
    print name
f.close()
Sign up to request clarification or add additional context in comments.

4 Comments

Thanks! Very helpful, but I am trying to do the page source in this way rather than read in a html page as the aim is to build this into a function: hoovers = 'hoovers.com/company-information/…' req = urllib2.Request(hoovers) page = urllib2.urlopen(req) soup = BeautifulSoup(page) , but I get an Atttribute error that table has no attribute find when I run the third line of your solution!
BeautifulSoup constructor accepts a stream or string so you should be able to pass page.text or a streamed version.
Sorry I don't understand what you mean pass page.text or a streamed version?
In your top example you are calling page.text which returns a string of the contents of the web request. This should be sufficient to call the BeautifulSoup constructor. So, I think what you want is soup = BeautifulSoup(page.text). I am not familiar with how you are using urllib so I can only speak up to the page.text portion of your example. If you can provide more of your code I can try it out.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.