2

I am trying to scrape an interactive graph from a website and this script works and gathers the text from the chart, but it is quite slow. In order for the text to appear the cursor has to hover over the graph in certain positions. Does anyone have any suggestions on how to make it more efficient?

Right now with each offset movement it runs through all the previous movements before it progresses to the next one. Does anyone know how to bypass that? (for example it goes 0 to 5 then instead of progressing from 5 to 10, it goes back again to 0 then 5 then 10)

# set the pace at which the cursor will move and the limit to which it will move to 
# (which should be the current date or the x-axis limit of the image)

# set the limit to be current date
limit = full_length
pace = 5
count = 0

while count <= limit:
    value = driver.find_element_by_class_name('highcharts-tooltip').text
    date_price = value.split("\n")
    date = date_price[0]
    price = date_price[1].split(": ")
    price = price[1]
    # take values at current point and add to dictionary
    dp = {'date': date,
         'price': price }
    archived_prices.append(dp)
    # move to the next date 
    action.move_by_offset(pace, 0).perform()
    # set up a counter to figure out when we will reach the limit
    count = count + pace
    print(count)

NEW CODE:

# adding new column with complete url for api call
full_urls = []

for value in dataframe['urlKeys']:
    full = 'https://stockx.com/api/products/'+value+'?includes=market,360&currency=EUR&country=IT'
    full_urls.append(full)
    
dataframe['urlFull'] = full_urls


def get_shoe_info(url_list):
    
    for url in url_list:

        headers = {
            "accept-encoding": "gzip, deflate, br",
            "sec-fetch-mode": "cors",
            "sec=fetch-site": "same-origin",
            "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36",
            "x-requested-with": "XMLHttpRequest"
        }

        response = requests.get(url, headers=headers)
        response.raise_for_status()
        
        product = response.json()["Product"]

        for p in product:
            id_num = p["id"]
            brand = p["brand"]
            colorway = p["colorway"]
            release_date = p["releaseDate"]
            retail_price = p["retailPrice"]
            shoe_name = p["shoe"]
            volatility = p["market"]["volatility"]
            change_percentage = p["market"]["changePercentage"]
            gender = p["gender"]
            print(f'ID: {id_num}\n'
              f'Brand: {brand}\n'
              f'Colorway: {colorway}\n'
              f'Release Date: {release_date}\n'
              f'Retail Price: {retail_price}\n'
              f'Shoe Name: {shoe_name}\n'
              f'Volatility: {volatility}\n'
              f'Change Percentage: {change_percentage}\n'
              f'Gender: {gender}')
            #print(p)
        return 0
    

if __name__ == "__main__":
    import sys
    sys.exit(get_shoe_info(full_urls))

I still have some difficulty with understanding how to pass in variables, so I am not sure I did this correctly. The first part is me taking the url keys for all the shoes and creating a url list to iterate through. Then I am trying to pass in the list to the get_shoe_info function. My error is popping up as "TypeError: string indices must be integers" and I investigated and saw that when I tried to print(p) to look at the path I was only getting strings of the key portions. I'm not sure what to do to get the values that I want.

I have added everything to (my github) in case you need to see anything else.

2
  • Could you share the URL of the page? There's a chance you may not have to use Selenium at all. Commented May 1, 2021 at 13:22
  • @PaulM. I am trying to scrape the interactive chart for past prices at the bottom stockx.com/adidas-yeezy-boost-700-bright-blue Commented May 1, 2021 at 13:35

1 Answer 1

1

Selenium is overkill for this. When I visited the page in my browser, I logged my network traffic, and I saw that my browser made several XHR (XmlHttpRequest) HTTP GET requests to REST APIs. One of them has the endpoint api/products/.../chart, which returns JSON containing all the chart information you're trying to scrape. All you need to do is imitate that HTTP GET request. No Selenium required. I simply copied all request headers and query-string parameters into some dictionaries (headers and params). I trimmed the request headers down to the bare minimum for what's required before the API complains and says my request is ill-formed.

I changed the accept-encoding header to accept only those encoding formats which the requests library supports natively, because by default the API wants to return Brotli-encoded JSON.

I'm currently located in Germany, which is why the query-string parameters say "currency": "EUR" and "country": "DE". As a result, the response will contain price information in euros, but you should be able to change these key-value pairs to suit your needs (I'm guessing "USD" and "US" should work). Also, it's important to note that the response contains a list of "pairs", one for each XY-coordinate / data point on the graph. The X-component is time (expressed as a unix timestamp measured in milliseconds), the Y-component is the price in euros (again, because of my query-string parameters).

Below, I define a generator get_price_history which makes a request to the API, and then yields all the data points. Each unix timestamp is first converted to a datetime.datetime object:

def get_price_history():

    import requests
    from datetime import datetime

    url = "https://stockx.com/api/products/f27be8fd-2e05-4caa-a70e-fe787aa6283e/chart"

    params = {
        "start_date": "all",
        "end_date": "2021-05-01",
        "intervals": "100",
        "format": "highstock",
        "currency": "EUR",
        "country": "DE"
    }

    headers = {
        "accept-encoding": "gzip, deflate",
        "sec-fetch-mode": "cors",
        "sec-fetch-site": "same-origin",
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36",
        "x-requested-with": "XMLHttpRequest"
    }

    response = requests.get(url, params=params, headers=headers)
    response.raise_for_status()

    for timestamp, price in response.json()["series"][0]["data"]:
        date = datetime.utcfromtimestamp(int(timestamp) // 1000)
        yield date, price


def main():

    for date, price in get_price_history():
        print(f"[{date}]: €{price}")
    
    return 0


if __name__ == "__main__":
    import sys
    sys.exit(main())

Output:

[2021-04-01 13:36:27]: €358
[2021-04-01 20:48:41]: €358
[2021-04-02 04:00:55]: €358
[2021-04-02 11:13:10]: €401
[2021-04-02 18:25:24]: €401
[2021-04-03 01:37:38]: €401
[2021-04-03 08:49:53]: €410
[2021-04-03 16:02:07]: €442
[2021-04-03 23:14:22]: €422
[2021-04-04 06:26:36]: €422
[2021-04-04 13:38:50]: €422
[2021-04-04 20:51:05]: €414
[2021-04-05 04:03:19]: €414
[2021-04-05 11:15:33]: €462
[2021-04-05 18:27:48]: €500
...

If you want to know more about how I logged my network traffic and found the API URL, request headers and query-string parameters, Take a look at this other answer I posted, where I go more in depth.


EDIT - Thanks for sharing your code. The problem is in your get_shoe_info function, where you do:

product = response.json()["Product"]

for p in product:
    id_num = p["id"]
    brand = p["brand"]
    ...

The issue is that product is a dictionary, so when you do for p in product:, you are iterating over the keys of that dictionary. The keys are of course strings, and so each p will be one string, and p["id"] will raise a TypeError. Effectively, what you've written is equivalent to "id"["id"], "brand"["brand"], etc. The for-loop is unnecessary - so the solution is to simply remove the for-loop. What you are calling p should actually be the product dictionary:

product = response.json()["Product"]

id_num = product["id"]
brand = product["brand"]
colorway = product["colorway"]
release_date = product["releaseDate"]
retail_price = product["retailPrice"]
shoe_name = product["shoe"]
volatility = product["market"]["volatility"]
change_percentage = product["market"]["changePercentage"]
gender = product["gender"]
print(f'ID: {id_num}\n'
  f'Brand: {brand}\n'
  f'Colorway: {colorway}\n'
  f'Release Date: {release_date}\n'
  f'Retail Price: {retail_price}\n'
  f'Shoe Name: {shoe_name}\n'
  f'Volatility: {volatility}\n'
  f'Change Percentage: {change_percentage}\n'
  f'Gender: {gender}')

Output (for a single URL):

ID: cfa4ef16-7dec-4ef5-9318-3793c2c8546d
Brand: Louis Vuitton
Colorway: Red
Release Date: 2009-07-01
Retail Price: 870
Shoe Name: Louis Vuitton Don
Volatility: 0
Change Percentage: 0
Gender: men
>>> 
Sign up to request clarification or add additional context in comments.

7 Comments

AWESOME. Thank you so much for the explanation as well. You have saved me a ton of time.
@GabbyV. Glad I could help. Let me know if you have any follow-up questions.
Hi I have another question.. So I have scraped the url keys from the website for each pair of shoes. I am now trying to adjust what you wrote to work with a list of urls to iterate through rather than just the one url. I was trying to pass the list into the function, but I am getting an error saying "IndexError: string index out of range". Is this the way I should be going about this? If you have time I'd be happy to share the code with you.
@GabbyV. Sure, I think the easiest would be for you to edit your original post with the new, up-to-date code. I'd be happy to take a look.
Sounds good! I've updated it and I attached my github with the notebooks incase you want to see more
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.