0

I am looking to parse and save the contents of json file which is embedded in the html code. However when I isolate the relevant string and try and load it with json package I receive an error JSONDecodeError: Extra data and I am unsure what is causing this.

It was suggested that the relevant code actually could contain multiple dictionaries and this might be problematic, but I'm not clear on how to proceed if this is true. My code is provided below. Any suggestions much appreciated!

from bs4 import BeautifulSoup
import urllib.request 
from urllib.request import HTTPError
import csv
import json
import re

def left(s, amount):
    return s[:amount]

def right(s, amount):
    return s[-amount:]

def mid(s, offset, amount):
    return s[offset:offset+amount]
url= "url"
from urllib.request import Request, urlopen
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
try:
    s = urlopen(req,timeout=200).read()
except urllib.request.HTTPError as e:
    print(str(e))  
soup = BeautifulSoup(s, "lxml")
tables=soup.find_all("script")
for i in range(0,len(tables)):
    if str(tables[i]).find("TimeLine.init")>-1:
        dat=str(tables[i]).splitlines()
        for tbl in dat:
            if str(tbl).find("TimeLine.init")>-1:
                s=str(tbl).strip()
j=json.loads(s)
5
  • If it contains multiple dictionaries you need to parse them one at a time. Hard to break the string off into pieces coresponding to them without doing some clever parsing. Does it fail every time? What is the string it fails on exactly? How does s look like before json.loads is called? Commented Nov 20, 2016 at 13:05
  • s is quite long - approx 50k characters, so can't post it fully. Will add extract though Commented Nov 20, 2016 at 13:11
  • Although it is not directly dedicated to your technical problem. It is rarely a good idea to parse website content for internal data. Mostly because you are not allowed to do so, but also because it might change. Commented Nov 20, 2016 at 13:18
  • @dahrens, how so? Point 7 of those terms allows him to do so for personal, non-commercial use. That the content may change is the nature of the web. Commented Nov 20, 2016 at 13:40
  • "any suggestions" is too broad. Commented Nov 20, 2016 at 13:45

2 Answers 2

1

You could use JSON's own exception reporting to help with parsing which gives the location of where the loads() failed, for example:

Extra data: line 1 column 1977 (char 1976)

The following script first locates the all the javascript <script> tags and looks for the function inside each. It then finds the outer start and end of the JSON text. With this it then attempts to decode it, notes the failing offset, skips this character and tries again. When the final block is found, it will decode succesfully. It then calls loads() on each valid block, storing the results in json_decoded:

from bs4 import BeautifulSoup
from urllib.request import HTTPError, Request, urlopen
import csv
import json

url = "url"
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})

try:
    s = urlopen(req, timeout=200).read()
except urllib.request.HTTPError as e:
    print(str(e))  

json_decoded = []
soup = BeautifulSoup(s, "lxml")

for script in soup.find_all("script", attrs={"type" : "text/javascript"}):
    text = script.text
    search = 'FieldView.TimeLine.init('
    field_start = text.find(search)

    if field_start != -1:
        # Find the start and end of the JSON in the function
        json_offsets = []
        json_start = field_start + len(search)
        json_end = text.rfind('}', 0, text.find(');', json_start)) + 1

        # Extract JSON
        json_text = text[json_start : json_end]

        # Attempt to decode, and record the offsets of where the decode fails
        offset = 0

        while True:
            try:
                dat = json.loads(json_text[offset:])
                break
            except json.decoder.JSONDecodeError as e:
                # Extract failed location from the exception report
                failed_at = int(re.search(r'char\s*(\d+)', str(e)).group(1))
                offset = offset + failed_at + 1
                json_offsets.append(offset)

        # Extract each valid block and decode it to a list
        cur_offset = 0

        for offset in json_offsets:
            json_block = json_text[cur_offset : offset - 1]
            json_decoded.append(json.loads(json_block))
            cur_offset = offset

print(json_decoded)

This results in json_decoded holding two JSON entries.

Sign up to request clarification or add additional context in comments.

Comments

1

You're trying to parse a string that looks like this:

FieldView.TimeLine.init( <first parameter - a json array>, <second parameter - a json array>, <third parameter, a json object>, true, 4, "29:58", 1798);

The angular brackets, < and >, only serve to group here, they have no special meaning and are not actually present.

You won't be able to parse that properly, because it is not valid json. Instead, strip the function call and add e.g. square braces to make the function's parameters wrapped into a json array.

json.loads("[{:s}]".format(str(dat[4]).strip()[24:-2])

2 Comments

btw this works, but I don't really understand how. Care to explain? Especially the "[{:s}]" part?
The {:s} inside the quote-delimited string is a placeholder: format will fill it with the string given in its arguments. See e.g. here or the official documentation. The square brackets surrounding that placeholder are simply there to add those characters to the string, so that the string looks like a JSON array of JSON objects. The [24:-2] part is used to get a substring of the string so that you end up with the parameters inside the javascript function call.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.