WebScraping & python: Rendering javascript in html?

Question

I'm trying to build a webcrawler to get the trending stocks from the tsx page. I currently get all the trending link, now I'm trying to scrape the information on the individual pages. Based on my code, when I try to output the "quote_wrapper" in the getStockDetails() it returns an empty list. I suspect it's because the JavaScript has not been rendered on the page yet? Not sure if that's a thing. Anyway, I tried to output all the html on the page to debug and I don't see it either. I read that only way to "render" the JavaScript is to use Selenium and use browser.execute_script("return document.documentElement.outerHTML"). It worked for the index page so I tried to use it on the others. I also made a comment about it in the code as well. Thanks for your help, if you can.

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup as soup
from urllib2 import urlopen as uReq

import time
import random
import requests


def getTrendingQuotes(source_code):
    # grabs all the trending quotes for that day
    links = []
    page_soup = soup(source_code, "lxml")
    trendingQuotes = page_soup.findAll("div", {"id": "trendingQuotes"})
    all_trendingQuotes = trendingQuotes[0].findAll('a')
    for link in all_trendingQuotes:
        url = link.get('href')
        name = link.text
        # print(name)
        links.append(url)
    return links


def getStockDetails(url, browser):
    print(url)
    source_code = browser.execute_script(
        "return document.documentElement.outerHTML")

    #What is the correct syntax here?
    #I'm trying to get the innerHTML of whole page in selenium driver
    #It seems I can only access the JavaScript for the entire page this way

    # source_code = browser.execute_script(
    #    "return" + url +".documentElement.outerHTML")

    page_soup = soup(source_code, "html.parser")
    # print(page_soup)
    quote_wrapper = page_soup.findAll("div", {"class": "quoteWrapper"})
    print(quote_wrapper)


def trendingBot(browser):

    while True:
        source_code = browser.execute_script(
            "return document.documentElement.outerHTML")
        trending = getTrendingQuotes(source_code)
        for trend in trending:
            browser.get(trend)
            getStockDetails(trend, browser)
        break
        # print(trend)


def Main():

    url = 'https://www.tmxmoney.com/en/index.html'
    browser = webdriver.Chrome(
        r"C:\Users\austi\OneDrive\Desktop\chromeDriver\chromedriver_win32\chromedriver.exe")
    browser.get(url)

    print("[+] Success! Bot Starting!")
    trendingBot(browser)
    browser.quit()


if __name__ == "__main__":
    Main()

Is the question about rendering html or rendering javascript? It's not clear what you're looking for. — pguardiario
– pguardiario, Commented Dec 9, 2018 at 7:58
Rendering javasScript. I'm trying to access the div element "quoteWrapper" in the getStockDetails. However, it returns an empty list. — pennyBoy
– pennyBoy, Commented Dec 9, 2018 at 8:02
No. .get takes you to each page and then using your find method returns the info. — QHarr
– QHarr, Commented Dec 9, 2018 at 8:42

ewwink · Accepted Answer · 2018-12-09 11:42:49Z

2

Please do not mix BeautifulSoup and Selenium its unnecessary. To render page with javascript you need to wait until the element generated, use WebDriverWait and get page source with browser.page_source but it not used here.

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait 

def getTrendingQuotes(browser):
    # wait until trending links appear, not really needed only for example
    all_trendingQuotes = WebDriverWait(browser, 10).until(
        lambda d: d.find_elements_by_css_selector('#trendingQuotes a')
    ) 
    return [link.get_attribute('href') for link in all_trendingQuotes]

def getStockDetails(url, browser):
    print(url)
    browser.get(url)
    quote_wrapper = browser.find_element_by_css_selector('div.quote-wrapper')
    print(quote_wrapper.text)
    #print(quote_wrapper.get_attribute('outerHTML'))

def trendingBot(url, browser):
    browser.get(url)
    trending = getTrendingQuotes(browser)
    for trend in trending:
        getStockDetails(trend, browser)

def Main():
    url = 'https://www.tmxmoney.com/en/index.html'
    browser = webdriver.Chrome(
        r"C:\Users\austi\OneDrive\Desktop\chromeDriver\chromedriver_win32\chromedriver.exe")
    print("[+] Success! Bot Starting!")
    trendingBot(url, browser)
    browser.quit()

if __name__ == "__main__":
    Main()

edited Dec 9, 2018 at 11:42

answered Dec 9, 2018 at 11:17

ewwink

19.3k2 gold badges49 silver badges56 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

undetected Selenium Over a year ago

The answer was great but suggesting time.sleep() have killed it all.

ewwink Over a year ago

thanks, it just easier method without much typing :D

pennyBoy Over a year ago

@ewwink could you explain what you mean by not mixing Selenium and BeautifulSoup further? and this might be silly, but I'm not sure how to parse the text from the quote-wrapper; the individual elements in the quote-wrapper. for example: storing the "quote-name" or "quote-price in a variable.

ewwink Over a year ago

what BeautifulSoup can do, can be done by selenium and event better because it use real browser for html parser. You can read how to Locating Elements and to select child inside quote-wrapper use quote_wrapper.find_element_by_class_name("quote-name" )

Collectives™ on Stack Overflow

WebScraping & python: Rendering javascript in html?

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related