How to parse information with python from a webpage that uses php and javascript

Question

I'm trying to get all the events and additional metadata to those events from this webpage : https://alando-palais.de/events

My problem is, that the result(html) doesn't contain the information I'm looking for. I guess, they are "hidden" behind some php script. This url: 'https://alando-palais.de/wp/wp-admin/admin-ajax.php'

Any idea, on how to wait until the page is completely loaded, or what kind of method do I have to use, to get the event information?

This is my script right now :-) :

from bs4 import BeautifulSoup
from urllib.request import urlopen, urljoin
from urllib.parse import urlparse
import re
import requests

if __name__ == '__main__':
    target_url = 'https://alando-palais.de/events'
    #target_url = 'https://alando-palais.de/wp/wp-admin/admin-ajax.php'

    soup = BeautifulSoup(requests.get(target_url).text, 'html.parser')
    print(soup)

    links = soup.find_all('a', href=True)
    for x,link in enumerate(links):
        print(x, link['href'])


#    for image in images:
#        print(urljoin(target_url, image))

Expected output would be something like:

Date: 08.03.2019
Title: Penthouse Club Special: Maiwai & Friends
img: https://alando-palais.de/wp/wp-content/uploads/2019/02/0803_MaiwaiFriends-500x281.jpg"

That's something out of this result:

<div class="vc_gitem-zone vc_gitem-zone-b vc_custom_1547045488900 originalbild vc-gitem-zone-height-mode-auto vc_gitem-is-link" style="background-image: url(https://alando-palais.de/wp/wp-content/uploads/2019/02/0803_MaiwaiFriends-500x281.jpg) !important;">
    <a href="https://alando-palais.de/event/penthouse-club-special-maiwai-friends" title="Penthouse Club Special: Maiwai &#038; Friends" class="vc_gitem-link vc-zone-link"></a>    <img src="https://alando-palais.de/wp/wp-content/uploads/2019/02/0803_MaiwaiFriends-500x281.jpg" class="vc_gitem-zone-img" alt="">  <div class="vc_gitem-zone-mini">
        <div class="vc_gitem_row vc_row vc_gitem-row-position-top"><div class="vc_col-sm-6 vc_gitem-col vc_gitem-col-align-left">   <div class="vc_gitem-post-meta-field-Datum eventdatum vc_gitem-align-left"> 08.03.2019
    </div>

what's the expected output? One example perhaps as I'm unsure what you mean by metadata. — QHarr
– QHarr, Commented Mar 8, 2019 at 20:55
if php requests are fired off by javascript, the results will not be available when the base page is loaded, you would have to render it to have the data calls made.. maybe use selenium to render the results and then get the final page out when it is done. — danchik
– danchik, Commented Mar 8, 2019 at 21:13
Yeah, I thought about using selenium or pyqt to emulate a "real" browser. Could you provide a few lines to get me started with selenium? — Xenobiologist
– Xenobiologist, Commented Mar 8, 2019 at 21:15

QHarr · Accepted Answer · 2019-03-09 06:06:30Z

2

You could mimic the xhr post made by the page

from bs4 import BeautifulSoup
import requests
import pandas as pd

url = 'https://alando-palais.de/wp/wp-admin/admin-ajax.php'

data = {

  'action': 'vc_get_vc_grid_data',
  'vc_action': 'vc_get_vc_grid_data',
  'tag': 'vc_basic_grid',
  'data[visible_pages]' : 5,
  'data[page_id]' : 30,
  'data[style]' : 'all',
  'data[action]' : 'vc_get_vc_grid_data',
  'data[shortcode_id]' : '1551112413477-5fbaaae1-0622-2',
  'data[tag]' : 'vc_basic_grid',
  'vc_post_id' : '30',
  '_vcnonce' : 'cc8cc954a4'  

}

res = requests.post(url, data = data)
soup = BeautifulSoup(res.content, 'lxml')
dates = [item.text.strip() for item in soup.select('.vc_gitem-zone[style*="https://alando-palais.de"]')]
textInfo = [item for item in soup.select('.vc_gitem-link')][::2]
imageLinks = [item['src'].strip() for item in soup.select('img')]
titles = []
links = []
for item in textInfo:
    titles.append(item['title'])
    links.append(item['href'])
results = pd.DataFrame(list(zip(titles, dates, links, imageLinks)),columns = ['title', 'date', 'link', 'imageLink'])
print(results)

Or with selenium:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd

url = 'https://alando-palais.de/events#'
driver = webdriver.Chrome()
driver.get(url)

dates = [item.text.strip() for item in WebDriverWait(driver,10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".vc_gitem-zone[style*='https://alando-palais.de']"))) if len(item.text)]
textInfo = [item for item in driver.find_elements_by_css_selector('.vc_gitem-link')][::2]
textInfo = textInfo[: int(len(textInfo) / 2)]
imageLinks = [item.get_attribute('src').strip() for item in driver.find_elements_by_css_selector('a + img')][::2]
titles = []
links = []

for item in textInfo:
    titles.append(item.get_attribute('title'))
    links.append(item.get_attribute('href'))
results = pd.DataFrame(list(zip(titles, dates, links, imageLinks)),columns = ['title', 'date', 'link', 'imageLink'])

print(results)

driver.quit()

edited Mar 9, 2019 at 6:06

answered Mar 8, 2019 at 22:05

QHarr

84.5k14 gold badges58 silver badges105 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Xenobiologist Over a year ago

Thanks a lot. The first script returns: Empty DataFrame Columns: [title, date, link, imageLink] Index: [] The other run into errors. I guess, I have to setup selenium first.

Xenobiologist Over a year ago

Thanks. I did the setup and changed the code above to use Firefox in lieu of Chrome. The script runs and te result looks like this:0 Da wo der Pfeffi wächst ... alando-palais.de/wp/wp-content/uploads... 1 Vodka Vriday ... alando-palais.de/wp/wp-content/uploads... 2 Über 40 Party ... alando-palais.de/wp/wp-content/uploads... I have to check how that works. Why the dates are "..." and so on. Thanks a lot. The scipt is a great start.

QHarr Over a year ago

I will have a look. Are you saying dates are all empty?

Xenobiologist Over a year ago

The selenium script retunrs: title ... imageLink 0 Da wo der Pfeffi wächst ... alando-palais.de/wp/wp-content/uploads... 1 Vodka Vriday ... alando-palais.de/wp/wp-content/uploads... 2 ... Über 40 Party ... alando-palais.de/wp/wp- 9 Uni Royal ... alando-palais.de/wp/wp-content/uploads... [10 rows x 4 columns]

Xenobiologist Over a year ago

That looks quite good. The BS4 script (first one of your post) doesn't return results

|

Pablo Martinez · Accepted Answer · 2019-03-08 21:58:16Z

1

I better recommend you selenium to bypass all the server restrictions.

Edited

from selenium import webdriver

driver = webdriver.Firefox()
driver.get("https://alando-palais.de/events")
elems = driver.find_elements_by_xpath("//a[@href]")
for elem in elems:
    print elem.get_attribute("href")

edited Mar 8, 2019 at 21:58

answered Mar 8, 2019 at 21:15

Pablo Martinez

4593 silver badges7 bronze badges

2 Comments

Xenobiologist Over a year ago

How would a small starting script look like?

Xenobiologist Over a year ago

Thanks, this script shows exactly the same infos like my starting script. I ' ll have to dig deeper into these techniques.

Collectives™ on Stack Overflow

How to parse information with python from a webpage that uses php and javascript

2 Answers 2

6 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related