12

I'm trying to scrape a JavaScript enables page using BS and Selenium. I have the following code so far. It still doesn't somehow detect the JavaScript (and returns a null value). In this case I'm trying to scrape the Facebook comments in the bottom. (Inspect element shows the class as postText)
Thanks for the help!

from selenium import webdriver  
from selenium.common.exceptions import NoSuchElementException  
from selenium.webdriver.common.keys import Keys  
import BeautifulSoup

browser = webdriver.Firefox()  
browser.get('http://techcrunch.com/2012/05/15/facebook-lightbox/')  
html_source = browser.page_source  
browser.quit()

soup = BeautifulSoup.BeautifulSoup(html_source)  
comments = soup("div", {"class":"postText"})  
print comments
3
  • 3
    You may want to try setting a wait on the page - you are likely exiting before the page has time to fully load (remember, it is just like a browser and experiences latency). In your case, you could likely solve it by just waiting for a certain period of time, but the more elegant solution(s) can be found at seleniumhq.org/docs/04_webdriver_advanced.jsp#implicit-waits Commented Jan 25, 2013 at 20:30
  • 1
    I'm not too sure if the wait was the issue, as I removed browser.quit() and ran the program. There was no luck. Commented Jan 25, 2013 at 21:35
  • 1
    The problem is actually the line before - it is loading page_source before there is any source to be loaded :) Commented Jan 25, 2013 at 21:50

1 Answer 1

10

There are some mistakes in your code that are fixed below. However, the class "postText" must exist elsewhere, since it is not defined in the original source code. My revised version of your code was tested and is working on multiple websites.

from selenium import webdriver  
from selenium.common.exceptions import NoSuchElementException  
from selenium.webdriver.common.keys import Keys  
from bs4 import BeautifulSoup

browser = webdriver.Firefox()  
browser.get('http://techcrunch.com/2012/05/15/facebook-lightbox/')  
html_source = browser.page_source  
browser.quit()

soup = BeautifulSoup(html_source,'html.parser')  
#class "postText" is not defined in the source code
comments = soup.findAll('div',{'class':'postText'})  
print comments
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for this. It really helped me save a lot of time.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.