Web scraping with python -selenium

Question

I want to scrape all href contents from the class "news" (Url is mentioned in the code) , I tried this code, but it is not working...

Code:

from bs4 import BeautifulSoup
from selenium import webdriver

Base_url = "http://www.thehindubusinessline.com/stocks/abb-india-ltd/overview/"

driver = webdriver.Chrome()
driver.set_window_position(-10000,-10000)
driver.get(Base_url)

html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')

for div in soup.find_all('div', class_='news'):  
    a = div.findAll('a')   
    print(a['href'])

Thank you

I think your issue is that the page doesn't have any divs with news class. It has articles with news class. — jayant
– jayant, Commented Feb 10, 2018 at 7:08
@jayant do you know any method to scrape all those href's ? i want all those href contents(latest news) — user9341326
– user9341326, Commented Feb 10, 2018 at 7:16

SIM · Accepted Answer · 2018-02-10 10:04:04Z

2

The content you want is located inside the frame:

<iframe width="100%" frameborder="0" src="http://hindubusiness.cmlinks.com/Companydetails.aspx?&cocode=INE117A01022" id="compInfo" height="600px">...</iframe>

So, first you'll have to switch to that frame. You can do this by adding these lines:

driver.switch_to.default_content()
driver.switch_to.frame('compInfo')

Complete code (making it headless):

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

Base_url = "http://www.thehindubusinessline.com/stocks/abb-india-ltd/overview/"

chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(chrome_options=chrome_options)
driver.get(Base_url)
driver.switch_to.frame('compInfo')
soup = BeautifulSoup(driver.page_source, 'lxml')
for link in soup.select('.news a'):  
    print(link['href'])

Output:

/HomeFinancial.aspx?&cocode=INE117A01022&Cname=ABB-India-Ltd&srno=17040010444&opt=9
/HomeFinancial.aspx?&cocode=INE117A01022&Cname=ABB-India-Ltd&srno=17038039002&opt=9
/HomeFinancial.aspx?&cocode=INE117A01022&Cname=ABB-India-Ltd&srno=17019039003&opt=9
/HomeFinancial.aspx?&cocode=INE117A01022&Cname=ABB-India-Ltd&srno=17019038003&opt=9
/HomeFinancial.aspx?&cocode=INE117A01022&Cname=ABB-India-Ltd&srno=17019010085&opt=9

edited Feb 10, 2018 at 10:04

SIM

22.5k6 gold badges45 silver badges116 bronze badges

answered Feb 10, 2018 at 8:12

Keyur Potdar

7,2386 gold badges27 silver badges40 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

user9341326 Over a year ago

can we hide that chrome pop up window?

Keyur Potdar Over a year ago

Yes we can. Search how to add arguments with chrome_options. And use --headless.

Keyur Potdar Over a year ago

Or even better, use PhantomJS

SIM Over a year ago

In case of choosing any headless browser, I think chrome is better than phantomjs as the earlier has fewer issues than the latter when it comes to initiate click. However, the above script can run headlessly now @python.

SIM Over a year ago

The issue you are facing is path related so there are two options: 1. create a new post describing the barrier you are facing and drop a link here 2. Or, stick to the way you were doing first.

|

jayant · Accepted Answer · 2018-02-10 07:17:31Z

0

Something like this will work:

for div in soup.find_all('article', 'news'):
    a = div.findAll('a')
    links = [article['href'] for article in a ]
    print(links)

answered Feb 10, 2018 at 7:17

jayant

2,3891 gold badge19 silver badges27 bronze badges

1 Comment

user9341326 Over a year ago

i want href's from company news

Collectives™ on Stack Overflow

Web scraping with python -selenium

2 Answers 2

9 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

9 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related