0

I'm trying to get all the events and additional metadata to those events from this webpage : https://alando-palais.de/events

My problem is, that the result(html) doesn't contain the information I'm looking for. I guess, they are "hidden" behind some php script. This url: 'https://alando-palais.de/wp/wp-admin/admin-ajax.php'

Any idea, on how to wait until the page is completely loaded, or what kind of method do I have to use, to get the event information?

This is my script right now :-) :

from bs4 import BeautifulSoup
from urllib.request import urlopen, urljoin
from urllib.parse import urlparse
import re
import requests

if __name__ == '__main__':
    target_url = 'https://alando-palais.de/events'
    #target_url = 'https://alando-palais.de/wp/wp-admin/admin-ajax.php'

    soup = BeautifulSoup(requests.get(target_url).text, 'html.parser')
    print(soup)

    links = soup.find_all('a', href=True)
    for x,link in enumerate(links):
        print(x, link['href'])


#    for image in images:
#        print(urljoin(target_url, image))

Expected output would be something like:

That's something out of this result:

<div class="vc_gitem-zone vc_gitem-zone-b vc_custom_1547045488900 originalbild vc-gitem-zone-height-mode-auto vc_gitem-is-link" style="background-image: url(https://alando-palais.de/wp/wp-content/uploads/2019/02/0803_MaiwaiFriends-500x281.jpg) !important;">
    <a href="https://alando-palais.de/event/penthouse-club-special-maiwai-friends" title="Penthouse Club Special: Maiwai &#038; Friends" class="vc_gitem-link vc-zone-link"></a>    <img src="https://alando-palais.de/wp/wp-content/uploads/2019/02/0803_MaiwaiFriends-500x281.jpg" class="vc_gitem-zone-img" alt="">  <div class="vc_gitem-zone-mini">
        <div class="vc_gitem_row vc_row vc_gitem-row-position-top"><div class="vc_col-sm-6 vc_gitem-col vc_gitem-col-align-left">   <div class="vc_gitem-post-meta-field-Datum eventdatum vc_gitem-align-left"> 08.03.2019
    </div>
5
  • what's the expected output? One example perhaps as I'm unsure what you mean by metadata. Commented Mar 8, 2019 at 20:55
  • I've added some results, I'd like to extract. Commented Mar 8, 2019 at 21:01
  • if php requests are fired off by javascript, the results will not be available when the base page is loaded, you would have to render it to have the data calls made.. maybe use selenium to render the results and then get the final page out when it is done. Commented Mar 8, 2019 at 21:13
  • Yeah, I thought about using selenium or pyqt to emulate a "real" browser. Could you provide a few lines to get me started with selenium? Commented Mar 8, 2019 at 21:15
  • this has some info stackoverflow.com/questions/29404856/… Commented Mar 8, 2019 at 21:15

2 Answers 2

2

You could mimic the xhr post made by the page

from bs4 import BeautifulSoup
import requests
import pandas as pd

url = 'https://alando-palais.de/wp/wp-admin/admin-ajax.php'

data = {

  'action': 'vc_get_vc_grid_data',
  'vc_action': 'vc_get_vc_grid_data',
  'tag': 'vc_basic_grid',
  'data[visible_pages]' : 5,
  'data[page_id]' : 30,
  'data[style]' : 'all',
  'data[action]' : 'vc_get_vc_grid_data',
  'data[shortcode_id]' : '1551112413477-5fbaaae1-0622-2',
  'data[tag]' : 'vc_basic_grid',
  'vc_post_id' : '30',
  '_vcnonce' : 'cc8cc954a4'  

}

res = requests.post(url, data = data)
soup = BeautifulSoup(res.content, 'lxml')
dates = [item.text.strip() for item in soup.select('.vc_gitem-zone[style*="https://alando-palais.de"]')]
textInfo = [item for item in soup.select('.vc_gitem-link')][::2]
imageLinks = [item['src'].strip() for item in soup.select('img')]
titles = []
links = []
for item in textInfo:
    titles.append(item['title'])
    links.append(item['href'])
results = pd.DataFrame(list(zip(titles, dates, links, imageLinks)),columns = ['title', 'date', 'link', 'imageLink'])
print(results)

Or with selenium:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd

url = 'https://alando-palais.de/events#'
driver = webdriver.Chrome()
driver.get(url)

dates = [item.text.strip() for item in WebDriverWait(driver,10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".vc_gitem-zone[style*='https://alando-palais.de']"))) if len(item.text)]
textInfo = [item for item in driver.find_elements_by_css_selector('.vc_gitem-link')][::2]
textInfo = textInfo[: int(len(textInfo) / 2)]
imageLinks = [item.get_attribute('src').strip() for item in driver.find_elements_by_css_selector('a + img')][::2]
titles = []
links = []

for item in textInfo:
    titles.append(item.get_attribute('title'))
    links.append(item.get_attribute('href'))
results = pd.DataFrame(list(zip(titles, dates, links, imageLinks)),columns = ['title', 'date', 'link', 'imageLink'])

print(results)

driver.quit()
Sign up to request clarification or add additional context in comments.

6 Comments

Thanks a lot. The first script returns: Empty DataFrame Columns: [title, date, link, imageLink] Index: [] The other run into errors. I guess, I have to setup selenium first.
Thanks. I did the setup and changed the code above to use Firefox in lieu of Chrome. The script runs and te result looks like this:0 Da wo der Pfeffi wächst ... alando-palais.de/wp/wp-content/uploads... 1 Vodka Vriday ... alando-palais.de/wp/wp-content/uploads... 2 Über 40 Party ... alando-palais.de/wp/wp-content/uploads... I have to check how that works. Why the dates are "..." and so on. Thanks a lot. The scipt is a great start.
I will have a look. Are you saying dates are all empty?
The selenium script retunrs: title ... imageLink 0 Da wo der Pfeffi wächst ... alando-palais.de/wp/wp-content/uploads... 1 Vodka Vriday ... alando-palais.de/wp/wp-content/uploads... 2 ... Über 40 Party ... alando-palais.de/wp/wp- 9 Uni Royal ... alando-palais.de/wp/wp-content/uploads... [10 rows x 4 columns]
That looks quite good. The BS4 script (first one of your post) doesn't return results
|
1

I better recommend you selenium to bypass all the server restrictions.

Edited

from selenium import webdriver

driver = webdriver.Firefox()
driver.get("https://alando-palais.de/events")
elems = driver.find_elements_by_xpath("//a[@href]")
for elem in elems:
    print elem.get_attribute("href")

2 Comments

How would a small starting script look like?
Thanks, this script shows exactly the same infos like my starting script. I ' ll have to dig deeper into these techniques.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.