BeautifulSoup: Parse JavaScript dynamic content

Question

I am developing a python web scraper with BeautifulSoup that parses "product listings" from this website and extracts some information for each product listing (i.e., price, vendor, etc.). I am able to extract many of this information but one (i.e., the product quantity), which seems to be hidden from the raw html. Looking at the webpage through my browser what I see is (unid = units):

product_name       1 unid      $10.00

but the html for that doesn't show any integer value that I can extract. It shows this html text:

<div class="e-col5 e-col5-offmktplace ">
  <div class="kWlJn zYaQqZ gQvJw">&nbsp;</div> 
  <div class="imgnum-unid"> unid</div>
</div>

My question is how do I get this hidden content of e-col5 which stores the product quantity?

import re
import requests
from bs4 import BeautifulSoup

page = requests.get("https://ligamagic.com.br/?view=cards%2Fsearch&card=Hapatra%2C+Vizier+of+Poisons")
soup = BeautifulSoup(page.content, 'html.parser')
vendor = soup.find_all('div', class_="estoque-linha", mp="2")
print(vendor[1].find(class_='e-col1').find('img')['title'])
print(vendor[1].find(class_='e-col2').find_all(class_='ed-simb')[1].string)
print(vendor[1].find(class_='e-col5'))

EDIT: Hidden content stands for JavasSript dynamically updated content in this case.

It appears that the supposedly hidden content is actually dynamically updated with JavaScript. — Luke
– Luke, Commented Dec 25, 2018 at 18:54
What is the proper way to parse this type of content @LukaszSalitra? — delirium
– delirium, Commented Dec 25, 2018 at 19:02
@delirium in general case it's hard. In your specific case may want to look into JavaScript to see what it's doing and basically re-implement it in your parser. — rvs
– rvs, Commented Dec 25, 2018 at 19:15

ewwink · Accepted Answer · 2018-12-25 20:30:52Z

2

the unid is saved in JS array

vetFiltro[0]=["e3724364",0,1,....];

the 1 is the unid, you can get it with regex

# e-col5
unitID = vendor[1].get('id').replace('line_', '') # line_e3724364 => e3724364
regEx = r'"%s",\d,(\d+)' % unitID
unit = re.search(regEx, page.text).group(1)
print(unit + ' unids')

answered Dec 25, 2018 at 20:30

ewwink

19.3k2 gold badges49 silver badges56 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

delirium Over a year ago

Thanks for the help! How did you found out that? Can I process any other JavaScript fields like that (e.g., price)?

ewwink Over a year ago

unfortunately I can't find way to get the price.

delirium Over a year ago

thanks anyway. Could you still comment on how you found out about vetFiltro?

ewwink Over a year ago

every vendor has ID like line_e3724364 with the line_ removed I found it in the page source. and you're welcome.

Fabian · Accepted Answer · 2018-12-25 20:28:41Z

1

If you take a closer look the unid is just an image in a div moved by a class to the correct number.

For example unid 1:

.jLsXy {
    background-image: url(arquivos/up/comp/imgunid/files/img/181224lSfWip8i1lmcj2a520836c8932ewcn.jpg);
}

is the image containing numbers.

.gBpKxZ {
background-position: -424px -23px;
}

is the class for number 1

So find the matching css to the number and create your table ( easy way ) but not best way.

Edit: Seems like changing the position(class) each time reloaded so its more hard to match the number with the image :( so the number 1 could be taken from many places.

Edit2 I was using chrome devtools. If you inspect the unid you will find the css for each class aswell. So after checking the url it was clear.

edited Dec 25, 2018 at 20:28

answered Dec 25, 2018 at 20:02

Fabian

1,15011 silver badges26 bronze badges

2 Comments

delirium Over a year ago

Thanks for your help :) ! How did you discover that the number is an image?

Fabian Over a year ago

@delirium check second edit :) if you need more explanation just ask me :)

0x48piraj · Accepted Answer · 2018-12-25 21:04:37Z

@ewwink found out the way to pull out unid but was unable to pull out prices. I have tried to pull out prices in this answer.

Target div snippet:

<div mp="2" id="line_e3724364" class="estoque-linha primeiro"><div class="e-col1"><a href="b/?p=e3724364" target="_blank"><img title="Rayearth Games" src="//www.lmcorp.com.br/arquivos/up/ecom/comparador/155937.jpg"></a></div><div class="e-col9-mobile"><div class="e-mob-edicao"><img src="//www.lmcorp.com.br/arquivos/up/ed_mtg/AKH_R.gif" height="19"></div><div class="e-mob-edicao-lbl"><p>Amonkhet</p></div><div class="e-mob-preco e-mob-preco-desconto"><font color="gray" class="mob-preco-desconto"><s>R$ 1,00</s></font><br>R$ 0,85</div></div><div class="e-col2"><a href="./?view=cards/search&amp;card=ed=akh" class="ed-simb"><img src="//www.lmcorp.com.br/arquivos/up/ed_mtg/AKH_R.gif" height="21"></a><font class="nomeedicao"><a href="./?view=cards/search&amp;card=ed=akh" class="ed-simb">Amonkhet</a></font></div><div class="e-col3"><font color="gray" class="mob-preco-desconto"><s>R$ 1,00</s></font><br>R$ 0,85</div>
                            <div class="e-col4 e-col4-offmktplace">
                                <img src="https://www.lmcorp.com.br/arquivos/img/bandeiras/pten.gif" title="Português/Inglês"> <font class="azul" onclick="cardQualidade(3);">SP</font>

                            </div>
                        <div class="e-col5 e-col5-offmktplace "><div class="cIiVr lHfXpZ mZkHz">&nbsp;</div> <div class="imgnum-unid"> unid</div></div><div class="e-col8 e-col8-offmktplace "><div><a target="_blank" href="b/?p=e3724364" class="goto" title="Visitar Loja">Ir à loja</a></div></div></div>

If we look closely, we can,

for item in soup.findAll('div', {"id": re.compile('^line')}):
 print(re.findall("R\$ (.*?)</div>", str(item), re.DOTALL))

Output [truncated]:

['10,00</s></font><br/>R$ 8,00', '10,00</s></font><br/>R$ 8,00']
['9,50</s></font><br/>R$ 8,55', '9,50</s></font><br/>R$ 8,55']
['9,50</s></font><br/>R$ 8,55', '9,50</s></font><br/>R$ 8,55']
['9,75</s></font><br/>R$ 8,78', '9,75</s></font><br/>R$ 8,78']
[]
[]

It extracts major chunks, and we'll get the prices. But this also skips multiple items.

To get all the data, we can use OCR API and Selenium to accomplish this. We can capture elements of interest by using the following snippet :

from selenium import webdriver
from PIL import Image
from io import BytesIO

fox = webdriver.Firefox()
fox.get('https://ligamagic.com.br/?view=cards%2Fsearch&card=Hapatra%2C+Vizier+of+Poisons')
#element = fox.find_element_by_id('line_e3724364')
element = fox.find_elements_by_tag_name('s')
location = element.location
size = element.size
png = fox.get_screenshot_as_png() # saves screenshot of entire page
fox.quit()

im = Image.open(BytesIO(png)) # uses PIL library to open image in memory

left = location['x']
top = location['y']
right = location['x'] + size['width']
bottom = location['y'] + size['height']


im = im.crop((left, top, right, bottom)) # defines crop points
im.save('screenshot.png') # saves new cropped image

Took help from https://stackoverflow.com/a/15870708.

We can iterate like we did above using re.findall() to save all the images. After we have all the images, we can then use OCR Space to extract text data. Here's a quick snippet :

import requests


def ocr_space_file(filename, overlay=False, api_key='api_key', language='eng'):

    payload = {'isOverlayRequired': overlay,
               'apikey': api_key,
               'language': language,
               }
    with open(filename, 'rb') as f:
        r = requests.post('https://api.ocr.space/parse/image',
                          files={filename: f},
                          data=payload,
                          )
    return r.content.decode()

e = ocr_space_file(filename='1.png')

print(e) # prints JSON

1.png :

JSON response from ocr.space :

{"ParsedResults":[{"TextOverlay":{"Lines":[],"HasOverlay":false,"Message":"Text overlay is not provided as it is not requested"},"TextOrientation":"0","FileParseExitCode":1,"ParsedText":"RS 0',85 \r\n","ErrorMessage":"","ErrorDetails":""}],"OCRExitCode":1,"IsErroredOnProcessing":false,"ProcessingTimeInMilliseconds":"1996","SearchablePDFURL":"Searchable PDF not generated as it was not requested."}

It gives us, "ParsedText" : "RS 0',85 \r\n".

Collectives™ on Stack Overflow

BeautifulSoup: Parse JavaScript dynamic content

3 Answers 3

4 Comments

2 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related