1

I'm new with Python and I'm trying to use BeautifulSoup to extract some data from a variable defined in a script.

data = soup.find_all('script', type='text/javascript')
print(data[0])

<script type="text/javascript">
  var myvar = {
    productid: "101",
    productname: "Abc",
  };
</script>

Do you know an easy way to extract the 'productid' and 'productname' from the myvar variable?

3 Answers 3

1

There's two ways. Easy, and wrong. Or not quite as easy, but correct.

I'm not going to recommend the easy way to you. The correct way is to use a Javascript parser. For modern Javascript, esprima is a good choice. There is an interactive online demo and it's also available as a Python module.

import esprima

# script body as extracted from beautifulsoup
script_text = """
  var myvar = {
    productid: "101",
    productname: "Abc",
  };
""";

tokens = esprima.tokenize(script_text)

In this simple script there is not a lot going on. The list of raw tokens would be enough to get to the values you want. It looks like this:

[
    {
        "type": "Keyword",
        "value": "var"
    },
    {
        "type": "Identifier",
        "value": "myvar"
    },
    {
        "type": "Punctuator",
        "value": "="
    },
    {
        "type": "Punctuator",
        "value": "{"
    },
    {
        "type": "Identifier",
        "value": "productid"
    },
    {
        "type": "Punctuator",
        "value": ":"
    },
    {
        "type": "String",
        "value": "\"101\""
    },
    {
        "type": "Punctuator",
        "value": ","
    },
    {
        "type": "Identifier",
        "value": "productname"
    },
    {
        "type": "Punctuator",
        "value": ":"
    },
    {
        "type": "String",
        "value": "\"Abc\""
    },
    {
        "type": "Punctuator",
        "value": ","
    },
    {
        "type": "Punctuator",
        "value": "}"
    },
    {
        "type": "Punctuator",
        "value": ";"
    }
]

Iterate the list and pick the values you need.

token_iterator = iter(tokens)

for token in token_iterator:
    if token["type"] == "Identifier" and token["value"] == "productname":
        # the token after the next must be the one that holds the associated value
        value_token = next(next(token_iterator))
        productname = value_token["value"]

For more complex situations, parsing the script into a tree and walking the tree might become necessary.

tree = esprima.parse(script_text)

The tree is more complex (you can view it on the interactive page), but in exchange it carries all the context information that is missing from the plain token list. You would then use the visitor pattern to walk this tree to a specific place. The Python package has an example how to use the visitor pattern if you're interested.

Sign up to request clarification or add additional context in comments.

Comments

0

For simple way I will use Regex

import re

.....
data = soup.find_all('script', type='text/javascript')
productid = re.search(r'productid:\s*"(.*?)"', data[0].text).group(1)
print(productid)

1 Comment

This will of course fail when there is an embedded quote in the product name.
0

Parse

from bs4 import BeautifulSoup

script_data='''
<script type="text/javascript">
  var myvar = {
    productid: "101",
    productname: "Abc",
  };
</script>
'''
soup = BeautifulSoup(script_data)

soup.script.string holds the data inside script tag as string. You can use split on string to get positional data:

soup.script.string.split()
Output:
['var',
 'myvar',
 '=',
 '{',
 'productid:',
 '"101",',
 'productname:',
 '"Abc",',
 '};']

product_id:

soup.script.string.split()[5].split('"')[1]
Output:
'101'

product_name:

soup.script.string.split()[7].split('"')[1]
Output:
'Abc'

3 Comments

Falls apart when the product name contains a space.
Yes it does, but the question is specifically asking for an easy way to filter vars productid and productname.
Okay, but it falls apart when there is a space in the product name (that's not even a far-fetched scenario). How does "easy" trump "correct"?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.