0

I am trying to scrape tables using selenium and beautifulsoup from this 3 websites:

https://www.erstebank.hr/hr/tecajna-lista

https://www.otpbanka.hr/tecajna-lista

https://www.sberbank.hr/tecajna-lista/

For all 3 websites result is HTML code for the table but without text.

My code is below:

import requests
from bs4 import BeautifulSoup
import pyodbc
import datetime

from selenium import webdriver

PATH = r'C:\Users\xxxxxx\AppData\Local\chromedriver.exe'

driver = webdriver.Chrome(PATH)

driver.get('https://www.erstebank.hr/hr/tecajna-lista')

driver.implicitly_wait(10)

soup = BeautifulSoup(driver.page_source, 'lxml')

table = soup.find_all('table')

print(table)

driver.close()

Please help what am I missing?

Thank you

3
  • I ran your code and got this output `[<table class="ebc-table"> <thead> <tr> <th> </th> <th>Val.</th> <th class="fade ng-hide" ng-show="vm.expanded">Šifra</th> <th>Jed.</th> <th align="center">Kupovni za efektivu</th> <th align="center">Kupovni za devize</th> <th align="center">Srednji tečaj</th> <th align="center">Prodajni za devize</th> <th align="center">Prodajni za efektivu</th> <th align="center">Srednji tečaj HNB-a</th> </tr> </thead> Commented Sep 29, 2021 at 11:18
  • There are more output but I can paste in comment box, SO does not allow. Commented Sep 29, 2021 at 11:18
  • The issue here seems to be due to a cookie request dialogue. See my answer below Commented Sep 29, 2021 at 11:42

3 Answers 3

1

The Website is taking time to load the data in the table.

Either Apply time.sleep

import time

driver.get('https://www.erstebank.hr/hr/tecajna-lista')
time.sleep(10)...

Or apply Explicit wait such that the rows are loaded in the tabel.

import requests
from bs4 import BeautifulSoup

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait

driver = webdriver.Chrome(executable_path="path to chromedriver.exe")
driver.maximize_window()

driver.get('https://www.erstebank.hr/hr/tecajna-lista')

wait = WebDriverWait(driver,30)
wait.until(EC.presence_of_all_elements_located((By.XPATH,"//table/tbody/tr[@class='ng-scope']")))

# driver.find_element_by_id("popin_tc_privacy_button_2").click() # Cookie setting pop-up. Works fine even without dealing with this pop-up. 
soup = BeautifulSoup(driver.page_source, 'html5lib')

table = soup.find_all('table')

print(table)
Sign up to request clarification or add additional context in comments.

Comments

0

BeautifulSoup will not find the table as it doesn't exist from it's reference point. Here, you tell Selenium to pause the Selenium driver matcher if it notices that an element is not present yet:

# This only works for the Selenium element matcher
driver.implicitly_wait(10)

Then, right after that, you get the current HTML state (table still does not exist) and put it into BeautifulSoup's parser. BS4 will not be able to see the table, even if it loads in later, because it will use the current HTML code you just gave it:

# You now move the CURRENT STATE OF THE HTML PAGE to BeautifulSoup's parser
soup = BeautifulSoup(driver.page_source, 'lxml')

# As this is now in BS4's hands, it will parse it immediately (won't wait 10 seconds)
table = soup.find_all('table')

# BS4 finds no tables as, when the page first loads, there are none.

To fix this, you can ask Selenium to try and get the HTML table itself. As Selenium will use the implicitly_wait you specified earlier, it will wait until it exists, and only then allow the rest of the code execution to persist. At that point, when BS4 receives the HTML code, the table will be there.

driver.implicitly_wait(10)

# Selenium will wait until the element is found
# I used XPath, but you can use any other matching sequence to get the table
driver.find_element_by_xpath("/html/body/div[2]/main/div/section/div[2]/div[1]/div/div/div/div/div/div/div[2]/div[6]/div/div[2]/table/tbody/tr[1]")

soup = BeautifulSoup(driver.page_source, 'lxml')

table = soup.find_all('table')

However, this is a bit overkill. Yes, you can use Selenium to parse the HTML, but you could also just use the requests module (which, from your code, I see you already have imported) to get the table data directly.

The data is asynchronously loaded from this endpoint (you can use the Chrome DevTools to find it yourself). You can pair this with the json module to turn it into a nicely formatted dictionary. Not only is this method faster, but it is also much less resource intensive (Selenium has to open a whole browser window).

from requests import get
from json import loads

# Get data from URL
data_as_text = get("https://local.erstebank.hr/rproxy/webdocapi/fx/current").text

# Turn to dictionary
data_dictionary = loads(data_as_text)

5 Comments

Thank you very much for help and quick response!
No problem! If my answer helped you, please upvote it and click the check mark! :)
Hello I tried to use requests as you suggested for other 2 web pages. On web page otpbanka.hr/tecajna-lista there is Request URL: otpbanka.hr/otp/ajax/exchange but Method is POST. In dev tools in Response tab I can see JSON code for the content but is there a way I can read it in Python? On web site sberbank.hr/tecajna-lista there is Request URL: sberbank.hr/umbraco/api/ExchangeRates/… - dateString value is probably current date in UNIX timestamp format. Is there a way to create this dateString value in Python?
Open a new question for that, and tag me in the comments so I can answer. Comments aren't usually used for open-ended discussion :(
@sitni By the way, I'd recommend using requests in general. It's much faster than selenium since you don't have to load a whole browser window.
0

You can use this as the foundation for further work:-

from bs4 import BeautifulSoup as BS
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

TDCLASS = 'ng-binding'

options = webdriver.ChromeOptions()
options.add_argument('--headless')
with webdriver.Chrome(options=options) as driver:
    driver.get('https://www.erstebank.hr/hr/tecajna-lista')
    try:
        # There may be a cookie request dialogue which we need to click through
        WebDriverWait(driver, 5).until(EC.presence_of_element_located(
            (By.ID, 'popin_tc_privacy_button_2'))).click()
    except Exception:
        pass  # Probably timed out so ignore on the basis that the dialogue wasn't presented
    # The relevant <td> elements all seem to be of class 'ng-binding' so look for those
    WebDriverWait(driver, 5).until(
        EC.presence_of_element_located((By.CLASS_NAME, TDCLASS)))
    soup = BS(driver.page_source, 'lxml')
    for td in soup.find_all('td', class_=TDCLASS):
        print(td)

1 Comment

Thank you very much for help and quick response!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.