3

I am developing a python web scraper with BeautifulSoup that parses "product listings" from this website and extracts some information for each product listing (i.e., price, vendor, etc.). I am able to extract many of this information but one (i.e., the product quantity), which seems to be hidden from the raw html. Looking at the webpage through my browser what I see is (unid = units):

product_name       1 unid      $10.00 

but the html for that doesn't show any integer value that I can extract. It shows this html text:

<div class="e-col5 e-col5-offmktplace ">
  <div class="kWlJn zYaQqZ gQvJw">&nbsp;</div> 
  <div class="imgnum-unid"> unid</div>
</div>

My question is how do I get this hidden content of e-col5 which stores the product quantity?

import re
import requests
from bs4 import BeautifulSoup

page = requests.get("https://ligamagic.com.br/?view=cards%2Fsearch&card=Hapatra%2C+Vizier+of+Poisons")
soup = BeautifulSoup(page.content, 'html.parser')
vendor = soup.find_all('div', class_="estoque-linha", mp="2")
print(vendor[1].find(class_='e-col1').find('img')['title'])
print(vendor[1].find(class_='e-col2').find_all(class_='ed-simb')[1].string)
print(vendor[1].find(class_='e-col5'))

EDIT: Hidden content stands for JavasSript dynamically updated content in this case.

3
  • 1
    It appears that the supposedly hidden content is actually dynamically updated with JavaScript. Commented Dec 25, 2018 at 18:54
  • What is the proper way to parse this type of content @LukaszSalitra? Commented Dec 25, 2018 at 19:02
  • @delirium in general case it's hard. In your specific case may want to look into JavaScript to see what it's doing and basically re-implement it in your parser. Commented Dec 25, 2018 at 19:15

3 Answers 3

2

the unid is saved in JS array

vetFiltro[0]=["e3724364",0,1,....];

the 1 is the unid, you can get it with regex

# e-col5
unitID = vendor[1].get('id').replace('line_', '') # line_e3724364 => e3724364
regEx = r'"%s",\d,(\d+)' % unitID
unit = re.search(regEx, page.text).group(1)
print(unit + ' unids')
Sign up to request clarification or add additional context in comments.

4 Comments

Thanks for the help! How did you found out that? Can I process any other JavaScript fields like that (e.g., price)?
unfortunately I can't find way to get the price.
thanks anyway. Could you still comment on how you found out about vetFiltro?
every vendor has ID like line_e3724364 with the line_ removed I found it in the page source. and you're welcome.
1

If you take a closer look the unid is just an image in a div moved by a class to the correct number.

For example unid 1:

.jLsXy {
    background-image: url(arquivos/up/comp/imgunid/files/img/181224lSfWip8i1lmcj2a520836c8932ewcn.jpg);
}

is the image containing numbers.

.gBpKxZ {
background-position: -424px -23px;
}

is the class for number 1

So find the matching css to the number and create your table ( easy way ) but not best way.

Edit: Seems like changing the position(class) each time reloaded so its more hard to match the number with the image :( so the number 1 could be taken from many places.

Edit2 I was using chrome devtools. If you inspect the unid you will find the css for each class aswell. So after checking the url it was clear.

2 Comments

Thanks for your help :) ! How did you discover that the number is an image?
@delirium check second edit :) if you need more explanation just ask me :)
1

@ewwink found out the way to pull out unid but was unable to pull out prices. I have tried to pull out prices in this answer.

Target div snippet:

<div mp="2" id="line_e3724364" class="estoque-linha primeiro"><div class="e-col1"><a href="b/?p=e3724364" target="_blank"><img title="Rayearth Games" src="//www.lmcorp.com.br/arquivos/up/ecom/comparador/155937.jpg"></a></div><div class="e-col9-mobile"><div class="e-mob-edicao"><img src="//www.lmcorp.com.br/arquivos/up/ed_mtg/AKH_R.gif" height="19"></div><div class="e-mob-edicao-lbl"><p>Amonkhet</p></div><div class="e-mob-preco e-mob-preco-desconto"><font color="gray" class="mob-preco-desconto"><s>R$ 1,00</s></font><br>R$ 0,85</div></div><div class="e-col2"><a href="./?view=cards/search&amp;card=ed=akh" class="ed-simb"><img src="//www.lmcorp.com.br/arquivos/up/ed_mtg/AKH_R.gif" height="21"></a><font class="nomeedicao"><a href="./?view=cards/search&amp;card=ed=akh" class="ed-simb">Amonkhet</a></font></div><div class="e-col3"><font color="gray" class="mob-preco-desconto"><s>R$ 1,00</s></font><br>R$ 0,85</div>
                            <div class="e-col4 e-col4-offmktplace">
                                <img src="https://www.lmcorp.com.br/arquivos/img/bandeiras/pten.gif" title="Português/Inglês"> <font class="azul" onclick="cardQualidade(3);">SP</font>

                            </div>
                        <div class="e-col5 e-col5-offmktplace "><div class="cIiVr lHfXpZ mZkHz">&nbsp;</div> <div class="imgnum-unid"> unid</div></div><div class="e-col8 e-col8-offmktplace "><div><a target="_blank" href="b/?p=e3724364" class="goto" title="Visitar Loja">Ir à loja</a></div></div></div>

If we look closely, we can,

for item in soup.findAll('div', {"id": re.compile('^line')}):
 print(re.findall("R\$ (.*?)</div>", str(item), re.DOTALL))

Output [truncated]:

['10,00</s></font><br/>R$ 8,00', '10,00</s></font><br/>R$ 8,00']
['9,50</s></font><br/>R$ 8,55', '9,50</s></font><br/>R$ 8,55']
['9,50</s></font><br/>R$ 8,55', '9,50</s></font><br/>R$ 8,55']
['9,75</s></font><br/>R$ 8,78', '9,75</s></font><br/>R$ 8,78']
[]
[]

It extracts major chunks, and we'll get the prices. But this also skips multiple items.

To get all the data, we can use OCR API and Selenium to accomplish this. We can capture elements of interest by using the following snippet :

from selenium import webdriver
from PIL import Image
from io import BytesIO

fox = webdriver.Firefox()
fox.get('https://ligamagic.com.br/?view=cards%2Fsearch&card=Hapatra%2C+Vizier+of+Poisons')
#element = fox.find_element_by_id('line_e3724364')
element = fox.find_elements_by_tag_name('s')
location = element.location
size = element.size
png = fox.get_screenshot_as_png() # saves screenshot of entire page
fox.quit()

im = Image.open(BytesIO(png)) # uses PIL library to open image in memory

left = location['x']
top = location['y']
right = location['x'] + size['width']
bottom = location['y'] + size['height']


im = im.crop((left, top, right, bottom)) # defines crop points
im.save('screenshot.png') # saves new cropped image

Took help from https://stackoverflow.com/a/15870708.

We can iterate like we did above using re.findall() to save all the images. After we have all the images, we can then use OCR Space to extract text data. Here's a quick snippet :

import requests


def ocr_space_file(filename, overlay=False, api_key='api_key', language='eng'):

    payload = {'isOverlayRequired': overlay,
               'apikey': api_key,
               'language': language,
               }
    with open(filename, 'rb') as f:
        r = requests.post('https://api.ocr.space/parse/image',
                          files={filename: f},
                          data=payload,
                          )
    return r.content.decode()

e = ocr_space_file(filename='1.png')

print(e) # prints JSON

1.png :

enter image description here

JSON response from ocr.space :

{"ParsedResults":[{"TextOverlay":{"Lines":[],"HasOverlay":false,"Message":"Text overlay is not provided as it is not requested"},"TextOrientation":"0","FileParseExitCode":1,"ParsedText":"RS 0',85 \r\n","ErrorMessage":"","ErrorDetails":""}],"OCRExitCode":1,"IsErroredOnProcessing":false,"ProcessingTimeInMilliseconds":"1996","SearchablePDFURL":"Searchable PDF not generated as it was not requested."}

It gives us, "ParsedText" : "RS 0',85 \r\n".

1 Comment

Nice work! How do I associate images with each listing?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.