web scraping table with selenium gets only html elements but no content

Question

I am trying to scrape tables using selenium and beautifulsoup from this 3 websites:

https://www.erstebank.hr/hr/tecajna-lista

https://www.otpbanka.hr/tecajna-lista

https://www.sberbank.hr/tecajna-lista/

For all 3 websites result is HTML code for the table but without text.

My code is below:

import requests
from bs4 import BeautifulSoup
import pyodbc
import datetime

from selenium import webdriver

PATH = r'C:\Users\xxxxxx\AppData\Local\chromedriver.exe'

driver = webdriver.Chrome(PATH)

driver.get('https://www.erstebank.hr/hr/tecajna-lista')

driver.implicitly_wait(10)

soup = BeautifulSoup(driver.page_source, 'lxml')

table = soup.find_all('table')

print(table)

driver.close()

Please help what am I missing?

Thank you

I ran your code and got this output `[<table class="ebc-table"> <thead> <tr> <th> </th> <th>Val.</th> <th class="fade ng-hide" ng-show="vm.expanded">Šifra</th> <th>Jed.</th> <th align="center">Kupovni za efektivu</th> <th align="center">Kupovni za devize</th> <th align="center">Srednji tečaj</th> <th align="center">Prodajni za devize</th> <th align="center">Prodajni za efektivu</th> <th align="center">Srednji tečaj HNB-a</th> </tr> </thead> — cruisepandey
– cruisepandey, Commented Sep 29, 2021 at 11:18
There are more output but I can paste in comment box, SO does not allow. — cruisepandey
– cruisepandey, Commented Sep 29, 2021 at 11:18
The issue here seems to be due to a cookie request dialogue. See my answer below — user2668284
– user2668284, Commented Sep 29, 2021 at 11:42

pmadhu · Accepted Answer · 2021-09-29 11:36:12Z

The Website is taking time to load the data in the table.

Either Apply time.sleep

import time

driver.get('https://www.erstebank.hr/hr/tecajna-lista')
time.sleep(10)...

Or apply Explicit wait such that the rows are loaded in the tabel.

import requests
from bs4 import BeautifulSoup

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait

driver = webdriver.Chrome(executable_path="path to chromedriver.exe")
driver.maximize_window()

driver.get('https://www.erstebank.hr/hr/tecajna-lista')

wait = WebDriverWait(driver,30)
wait.until(EC.presence_of_all_elements_located((By.XPATH,"//table/tbody/tr[@class='ng-scope']")))

# driver.find_element_by_id("popin_tc_privacy_button_2").click() # Cookie setting pop-up. Works fine even without dealing with this pop-up. 
soup = BeautifulSoup(driver.page_source, 'html5lib')

table = soup.find_all('table')

print(table)

Xiddoc · Accepted Answer · 2021-09-29 11:23:09Z

0

BeautifulSoup will not find the table as it doesn't exist from it's reference point. Here, you tell Selenium to pause the Selenium driver matcher if it notices that an element is not present yet:

# This only works for the Selenium element matcher
driver.implicitly_wait(10)

Then, right after that, you get the current HTML state (table still does not exist) and put it into BeautifulSoup's parser. BS4 will not be able to see the table, even if it loads in later, because it will use the current HTML code you just gave it:

# You now move the CURRENT STATE OF THE HTML PAGE to BeautifulSoup's parser
soup = BeautifulSoup(driver.page_source, 'lxml')

# As this is now in BS4's hands, it will parse it immediately (won't wait 10 seconds)
table = soup.find_all('table')

# BS4 finds no tables as, when the page first loads, there are none.

To fix this, you can ask Selenium to try and get the HTML table itself. As Selenium will use the implicitly_wait you specified earlier, it will wait until it exists, and only then allow the rest of the code execution to persist. At that point, when BS4 receives the HTML code, the table will be there.

driver.implicitly_wait(10)

# Selenium will wait until the element is found
# I used XPath, but you can use any other matching sequence to get the table
driver.find_element_by_xpath("/html/body/div[2]/main/div/section/div[2]/div[1]/div/div/div/div/div/div/div[2]/div[6]/div/div[2]/table/tbody/tr[1]")

soup = BeautifulSoup(driver.page_source, 'lxml')

table = soup.find_all('table')

However, this is a bit overkill. Yes, you can use Selenium to parse the HTML, but you could also just use the requests module (which, from your code, I see you already have imported) to get the table data directly.

The data is asynchronously loaded from this endpoint (you can use the Chrome DevTools to find it yourself). You can pair this with the json module to turn it into a nicely formatted dictionary. Not only is this method faster, but it is also much less resource intensive (Selenium has to open a whole browser window).

from requests import get
from json import loads

# Get data from URL
data_as_text = get("https://local.erstebank.hr/rproxy/webdocapi/fx/current").text

# Turn to dictionary
data_dictionary = loads(data_as_text)

answered Sep 29, 2021 at 11:23

Xiddoc

3,6663 gold badges18 silver badges42 bronze badges

5 Comments

sitni Over a year ago

Thank you very much for help and quick response!

Xiddoc Over a year ago

No problem! If my answer helped you, please upvote it and click the check mark! :)

sitni Over a year ago

Hello I tried to use requests as you suggested for other 2 web pages. On web page otpbanka.hr/tecajna-lista there is Request URL: otpbanka.hr/otp/ajax/exchange but Method is POST. In dev tools in Response tab I can see JSON code for the content but is there a way I can read it in Python? On web site sberbank.hr/tecajna-lista there is Request URL: sberbank.hr/umbraco/api/ExchangeRates/… - dateString value is probably current date in UNIX timestamp format. Is there a way to create this dateString value in Python?

Xiddoc Over a year ago

Open a new question for that, and tag me in the comments so I can answer. Comments aren't usually used for open-ended discussion :(

Xiddoc Over a year ago

@sitni By the way, I'd recommend using requests in general. It's much faster than selenium since you don't have to load a whole browser window.

user2668284 · Accepted Answer · 2021-09-29 11:40:59Z

0

You can use this as the foundation for further work:-

from bs4 import BeautifulSoup as BS
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

TDCLASS = 'ng-binding'

options = webdriver.ChromeOptions()
options.add_argument('--headless')
with webdriver.Chrome(options=options) as driver:
    driver.get('https://www.erstebank.hr/hr/tecajna-lista')
    try:
        # There may be a cookie request dialogue which we need to click through
        WebDriverWait(driver, 5).until(EC.presence_of_element_located(
            (By.ID, 'popin_tc_privacy_button_2'))).click()
    except Exception:
        pass  # Probably timed out so ignore on the basis that the dialogue wasn't presented
    # The relevant <td> elements all seem to be of class 'ng-binding' so look for those
    WebDriverWait(driver, 5).until(
        EC.presence_of_element_located((By.CLASS_NAME, TDCLASS)))
    soup = BS(driver.page_source, 'lxml')
    for td in soup.find_all('td', class_=TDCLASS):
        print(td)

answered Sep 29, 2021 at 11:40

user2668284

1 Comment

sitni Over a year ago

Thank you very much for help and quick response!

Collectives™ on Stack Overflow

web scraping table with selenium gets only html elements but no content

3 Answers 3

Comments

5 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

5 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related