Extract Text from Javascript using Python

Question

I've been looking at examples of how to do this but can't quite figure it out. I'm using beautifulsoup to scrape some data - I am able to use it to find the data I want, but it is contained in the following block of code. I'm trying to extract the timestamp information from it. I have a feeling regular expressions work here but I can't seem to figure it out - any suggestions??

    <script class="code" type="text/javascript">
    $(document).ready(function(){
    line1 = [['2009-02-23 10 AM', 5203], ['2009-02-08 10 AM', 3898], ['2009-02-09 10 AM', 4923], ['2009-02-22 10 AM', 3682], ['2009-02-21 10 AM', 3238], ['2009-02-20 10 AM', 4648]];
    options1 = {
    etc other text
      }
    });
    </script>

furas · Accepted Answer · 2016-10-05 00:39:59Z

1

You can't use BS to get this data - BS works only with HTML/XML, not JavaScript.

You have to use regular expressions or standart string functions.

EDIT:

text = '''<script class="code" type="text/javascript">
    $(document).ready(function(){
    line1 = [['2009-02-23 10 AM', 5203], ['2009-02-08 10 AM', 3898], ['2009-02-09 10 AM', 4923], ['2009-02-22 10 AM', 3682], ['2009-02-21 10 AM', 3238], ['2009-02-20 10 AM', 4648]];
    options1 = {
    etc other text
      }
    });
    </script>'''

import re

re.findall("'([^']*)'", text)

result:

['2009-02-23 10 AM',
 '2009-02-08 10 AM',
 '2009-02-09 10 AM',
 '2009-02-22 10 AM',
 '2009-02-21 10 AM',
 '2009-02-20 10 AM']

answered Oct 5, 2016 at 0:39

furas

149k12 gold badges121 silver badges171 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

n1c9 Over a year ago

yep, only thing bs4 will help you do is target this type of data - once you've found it, though, you have to parse it with a regex - I find using a regex + ast.literal_eval works well in some instances.

karty Over a year ago

The code provided definitely works - thank you. In my specific example, since I am using bs4 to target the data in the first place, the resulting data is a bs4 element, which doesn't work with regex. So I convert the result to a string but some of the code after the timestamps shows up. How can I limit it to what is between 'line1' and 'options1' in the orignal code? Everything I am trying is yielding empty.. e.g. re.findall("'(?<=x)([^']*)(?>=y)'", text), where x='line1' and y='options1'

furas Over a year ago

if there are \n then you can text.split('\n')[2].strip() and you get line1 = [...];. Now you can use slicing [8:-1] to remove line1 = and ; at then end - finally text.split('\n')[2].strip()[8:-1]

alecxe · Accepted Answer · 2016-10-05 13:37:43Z

One another alternative to using regular expressions to parse javascript code would be to use a JavaScript parser like slimit. Working code:

import json

from bs4 import BeautifulSoup
from slimit import ast
from slimit.parser import Parser
from slimit.visitors import nodevisitor

data = """<script class="code" type="text/javascript">
$(document).ready(function(){
line1 = [['2009-02-23 10 AM', 5203], ['2009-02-08 10 AM', 3898], ['2009-02-09 10 AM', 4923], ['2009-02-22 10 AM', 3682], ['2009-02-21 10 AM', 3238], ['2009-02-20 10 AM', 4648]];
options1 = {};
});
</script>"""

soup = BeautifulSoup(data, "html.parser")
parser = Parser()
tree = parser.parse(soup.script.get_text())

for node in nodevisitor.visit(tree):
    if isinstance(node, ast.Assign) and getattr(node.left, 'value', '') == 'line1':
        values = json.loads(node.right.to_ecma().replace("'", '"').strip())
        print(values)
        break

Prints a Python list:

[[u'2009-02-23 10 AM', 5203], [u'2009-02-08 10 AM', 3898], [u'2009-02-09 10 AM', 4923], [u'2009-02-22 10 AM', 3682], [u'2009-02-21 10 AM', 3238], [u'2009-02-20 10 AM', 4648]]

Collectives™ on Stack Overflow

Extract Text from Javascript using Python

2 Answers 2

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related