Web-scraping using python3

Question

i'm trying to catch some info about amazon stuff. Idk why my code doesn't work. Every time i try to test these lines, i get a None output. I'm using visual studio.

import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.amazon.it/Xiaomi-frequenza-Monitoraggio-Bracciale-Smartwatch/dp/B07T9DHKXL?pf_rd_r=F2MMPNCJR5AQ4KP5C82P&pf_rd_p=ff59f7ef-650d-5e5a-9ee5-6fd80bb0e21d&pd_rd_r=12e6add2-54cd-44b1-bfa4-81c70ad68010&pd_rd_w=Lo5MD&pd_rd_wg=t2rFz&ref_=pd_gw_ri"
)
soup = BeautifulSoup(page.content,'html.parser')
title = soup.find(id='productTitle')
price = soup.find(id='priceblock_ourprice')
print(title)
print(price)

did you try checking soup value? does it have these elements? — Rayan Ral
– Rayan Ral, Commented Jun 9, 2020 at 19:12

0m3r · Accepted Answer · 2020-06-10 03:59:31Z

1

Andrej Kesely gave you the answer while I was typing, but to understand why this happens,

just add this print line after the soup = ... :

soup = BeautifulSoup(page.content,'html.parser')
print(soup.find_all("title"))
title = soup.find(id='productTitle')

This will print:

[<title dir="ltr">Amazon CAPTCHA</title>]

Amazon isn't "showing" the real page to your code, it is asking for a captcha.

edited Jun 10, 2020 at 3:59

0m3r

12.5k15 gold badges40 silver badges77 bronze badges

answered Jun 9, 2020 at 19:16

Jomarumu

216 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Andrej Kesely · Accepted Answer · 2020-06-09 19:10:30Z

0

There are 2 problems:

1.) Use HTTP header User-Agent. Without it, Amazon sends you CAPTCHA page.

2.) As parser select html5lib or lxml. html.parser has problems parsing this page.

import requests
from bs4 import BeautifulSoup

url = 'https://www.amazon.it/Xiaomi-frequenza-Monitoraggio-Bracciale-Smartwatch/dp/B07T9DHKXL'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0'}

soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html5lib') # or 'lxml'

title = soup.find(id='productTitle')
price = soup.find(id='priceblock_ourprice')
print(title.get_text(strip=True))
print(price.get_text(strip=True))

Prints:

Xiaomi Mi Band 4 Activity Tracker,Monitor attività,Monitor frequenza cardiaca Monitoraggio Fitness, Bracciale Smartwatch con Schermo AMOLED a Colori 0,95, con iOS e Android (Versione Globale)
30,96 €

answered Jun 9, 2020 at 19:10

Andrej Kesely

196k15 gold badges60 silver badges105 bronze badges

1 Comment

Turf Over a year ago

ty man, i had to install lmxl using sudo pip3 install lxml, then it worked

Shreyas Sreenivas · Accepted Answer · 2020-06-09 19:07:55Z

0

Most modern websites including Amazon load webpages dynamically using javascript. When you send a request using requests.get you get only the initial render of the webpage without the dynamically loaded content. You could use a library like selenium to load dynamically loaded webpages and then parse the page source to beautiful soup.

answered Jun 9, 2020 at 19:07

Shreyas Sreenivas

3511 gold badge5 silver badges12 bronze badges

Collectives™ on Stack Overflow

Web-scraping using python3

3 Answers 3

Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related