Python Scraping JavaScript using Selenium and Beautiful Soup

Question

I'm trying to scrape a JavaScript enables page using BS and Selenium. I have the following code so far. It still doesn't somehow detect the JavaScript (and returns a null value). In this case I'm trying to scrape the Facebook comments in the bottom. (Inspect element shows the class as postText)
Thanks for the help!

from selenium import webdriver  
from selenium.common.exceptions import NoSuchElementException  
from selenium.webdriver.common.keys import Keys  
import BeautifulSoup

browser = webdriver.Firefox()  
browser.get('http://techcrunch.com/2012/05/15/facebook-lightbox/')  
html_source = browser.page_source  
browser.quit()

soup = BeautifulSoup.BeautifulSoup(html_source)  
comments = soup("div", {"class":"postText"})  
print comments

You may want to try setting a wait on the page - you are likely exiting before the page has time to fully load (remember, it is just like a browser and experiences latency). In your case, you could likely solve it by just waiting for a certain period of time, but the more elegant solution(s) can be found at seleniumhq.org/docs/04_webdriver_advanced.jsp#implicit-waits — RocketDonkey
– RocketDonkey, Commented Jan 25, 2013 at 20:30
I'm not too sure if the wait was the issue, as I removed browser.quit() and ran the program. There was no luck. — Jay Setti
– Jay Setti, Commented Jan 25, 2013 at 21:35
The problem is actually the line before - it is loading page_source before there is any source to be loaded :) — RocketDonkey
– RocketDonkey, Commented Jan 25, 2013 at 21:50

Community · Accepted Answer · 2014-10-29 09:58:24Z

10

There are some mistakes in your code that are fixed below. However, the class "postText" must exist elsewhere, since it is not defined in the original source code. My revised version of your code was tested and is working on multiple websites.

from selenium import webdriver  
from selenium.common.exceptions import NoSuchElementException  
from selenium.webdriver.common.keys import Keys  
from bs4 import BeautifulSoup

browser = webdriver.Firefox()  
browser.get('http://techcrunch.com/2012/05/15/facebook-lightbox/')  
html_source = browser.page_source  
browser.quit()

soup = BeautifulSoup(html_source,'html.parser')  
#class "postText" is not defined in the source code
comments = soup.findAll('div',{'class':'postText'})  
print comments

edited Oct 29, 2014 at 9:58

CommunityBot

11 silver badge

answered Mar 22, 2014 at 5:04

user3186527

1131 silver badge6 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Max Over a year ago

Thanks for this. It really helped me save a lot of time.

Collectives™ on Stack Overflow

Python Scraping JavaScript using Selenium and Beautiful Soup

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related