4

I am scraping a webpage that has a bunch of relevant information stored in a javascript variable

response = requests.get('')
r = response.text

inside r, there is a javascript variable that has a bunch of data I want

This is what is returned from the server:

<!DOCTYPE html>
<html>
<head>
....

<script>
 var candidate_details_input_string =  = '{ ...}'
</script>
....
</head>
</html>

Inside candidate_details_input_string is a bunch of stuff and I use .split() to isolate the list I want

x = r.split('candidate_completed_list\\":')[1].split(']')[0]+']'

However, this returns the javascript string, but I'm in Python. It looks something like this:

x = '[{\\"i_form_name\\":\\"Applicant_Information_Form\\",\\"completed_time\\":\\"2017-02-03T19:12:00.000Z\\"},{\\"i_form_name\\":\\"Voluntary_Self_Identification_of_Disability_template\\",\\"completed_time\\":\\"2017-02-03T19:14:00.000Z\\"},{\\"i_form_name\\":\\"Voluntary_Self_Identification_of_Disability_template\\",\\"completed_time\\":\\"2017-02-05T19:21:00.000Z\\"},{\\"i_form_name\\":\\"Government_Entity_Questions_Form\\",\\"completed_time\\":\\"2018-07-03T00:29:00.000Z\\"}]'

This is a javascript string and normally would JSON.parse(), but can't since I'm scraping it in python.

Is there anyway to turn this into a Python object I can work with? My default answer is do it by hand, replace all of the \\ and switch the ' into "

5
  • Can you share the URL? There are various ways how to extract javascript variables from text. Commented Jul 19, 2019 at 17:48
  • its not a publicly accessible url unfortunately :( Commented Jul 19, 2019 at 17:51
  • updated with the <script> tag Commented Jul 19, 2019 at 17:59
  • its actually {..}, sorry! Commented Jul 19, 2019 at 18:11
  • Can you post sample whats inside the '{...}' brackets? Commented Jul 19, 2019 at 18:12

3 Answers 3

1

You can load your x variable into a json(dictionary). We need to replace those \ and all is well:

import json

x = '[{\\"i_form_name\\":\\"Applicant_Information_Form\\",\\"completed_time\\":\\"2017-02-03T19:12:00.000Z\\"},{\\"i_form_name\\":\\"Voluntary_Self_Identification_of_Disability_template\\",\\"completed_time\\":\\"2017-02-03T19:14:00.000Z\\"},{\\"i_form_name\\":\\"Voluntary_Self_Identification_of_Disability_template\\",\\"completed_time\\":\\"2017-02-05T19:21:00.000Z\\"},{\\"i_form_name\\":\\"Government_Entity_Questions_Form\\",\\"completed_time\\":\\"2018-07-03T00:29:00.000Z\\"}]'

data = json.loads(x.replace('\\',''))

print(data)
Sign up to request clarification or add additional context in comments.

Comments

1

You can use ast.literal_eval in this case:

data = '''<!DOCTYPE html>
<html>
<head>
....

<script>
 var candidate_details_input_string = '{"i_form_name":"Applicant_Information_Form"}';
</script>
....
</head>
</html>'''

import re
from ast import literal_eval

s = re.findall(r'var candidate_details_input_string\s*=\s*\'(.*?\})\s*\'\s*;', data, flags=re.DOTALL)[0]
data = literal_eval(s)
print(data)

Prints:

{'i_form_name': 'Applicant_Information_Form'}

3 Comments

I'm getting an error on the literal_eval(s). SyntaxError: unexpected character after line continuation character
@MorganAllen It would help if you post what's inside the string, to adjust the regex appropriately.
let me see if i can strip out the confidential information
0

You're getting JSON back from requests. Try using the built in json library of python, you shouldn't have to do any manual parsing yourself.

import json
import requests

response = requests.get('')
r = todos = json.loads(response.text)

2 Comments

im getting a string of HTML back from JSON that has some Javascript inside it. I get this error: json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
@Morgan isolate the json string like you've already been doing (or using an html parser to get to the value), then pass it to json.loads()

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.