2

I've been looking at examples of how to do this but can't quite figure it out. I'm using beautifulsoup to scrape some data - I am able to use it to find the data I want, but it is contained in the following block of code. I'm trying to extract the timestamp information from it. I have a feeling regular expressions work here but I can't seem to figure it out - any suggestions??

    <script class="code" type="text/javascript">
    $(document).ready(function(){
    line1 = [['2009-02-23 10 AM', 5203], ['2009-02-08 10 AM', 3898], ['2009-02-09 10 AM', 4923], ['2009-02-22 10 AM', 3682], ['2009-02-21 10 AM', 3238], ['2009-02-20 10 AM', 4648]];
    options1 = {
    etc other text
      }
    });
    </script>

2 Answers 2

1

You can't use BS to get this data - BS works only with HTML/XML, not JavaScript.

You have to use regular expressions or standart string functions.


EDIT:

text = '''<script class="code" type="text/javascript">
    $(document).ready(function(){
    line1 = [['2009-02-23 10 AM', 5203], ['2009-02-08 10 AM', 3898], ['2009-02-09 10 AM', 4923], ['2009-02-22 10 AM', 3682], ['2009-02-21 10 AM', 3238], ['2009-02-20 10 AM', 4648]];
    options1 = {
    etc other text
      }
    });
    </script>'''

import re

re.findall("'([^']*)'", text)

result:

['2009-02-23 10 AM',
 '2009-02-08 10 AM',
 '2009-02-09 10 AM',
 '2009-02-22 10 AM',
 '2009-02-21 10 AM',
 '2009-02-20 10 AM']
Sign up to request clarification or add additional context in comments.

3 Comments

yep, only thing bs4 will help you do is target this type of data - once you've found it, though, you have to parse it with a regex - I find using a regex + ast.literal_eval works well in some instances.
The code provided definitely works - thank you. In my specific example, since I am using bs4 to target the data in the first place, the resulting data is a bs4 element, which doesn't work with regex. So I convert the result to a string but some of the code after the timestamps shows up. How can I limit it to what is between 'line1' and 'options1' in the orignal code? Everything I am trying is yielding empty.. e.g. re.findall("'(?<=x)([^']*)(?>=y)'", text), where x='line1' and y='options1'
if there are \n then you can text.split('\n')[2].strip() and you get line1 = [...];. Now you can use slicing [8:-1] to remove line1 = and ; at then end - finally text.split('\n')[2].strip()[8:-1]
1

One another alternative to using regular expressions to parse javascript code would be to use a JavaScript parser like slimit. Working code:

import json

from bs4 import BeautifulSoup
from slimit import ast
from slimit.parser import Parser
from slimit.visitors import nodevisitor

data = """<script class="code" type="text/javascript">
$(document).ready(function(){
line1 = [['2009-02-23 10 AM', 5203], ['2009-02-08 10 AM', 3898], ['2009-02-09 10 AM', 4923], ['2009-02-22 10 AM', 3682], ['2009-02-21 10 AM', 3238], ['2009-02-20 10 AM', 4648]];
options1 = {};
});
</script>"""

soup = BeautifulSoup(data, "html.parser")
parser = Parser()
tree = parser.parse(soup.script.get_text())

for node in nodevisitor.visit(tree):
    if isinstance(node, ast.Assign) and getattr(node.left, 'value', '') == 'line1':
        values = json.loads(node.right.to_ecma().replace("'", '"').strip())
        print(values)
        break

Prints a Python list:

[[u'2009-02-23 10 AM', 5203], [u'2009-02-08 10 AM', 3898], [u'2009-02-09 10 AM', 4923], [u'2009-02-22 10 AM', 3682], [u'2009-02-21 10 AM', 3238], [u'2009-02-20 10 AM', 4648]]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.