0

Each line is valid JSON, but I need the file as a whole to be valid JSON.

I have some data which is aggregated from a web service and dumped to a file, so it's JSON-eaque, but not valid JSON, so it can't be processed in the simple and intuitive way that JSON files can - thereby consituting a major pain in the neck, it looks (more or less) like this:

{"record":"value0","block":"0x79"} 
{"record":"value1","block":"0x80"} 

I've been trying to reinterpret it as valid JSON, my latest attempt looks like this:

with open('toy.json') as inpt:
    lines = []
    for line in inpt:
        if line.startswith('{'):  # block starts
            lines.append(line) 

However, as you can likely deduce by the fact that I'm posing this question- that doesn't work- any ideas about how I might tackle this problem?

EDIT:

Tried this:

with open('toy_two.json', 'rb') as inpt:

    lines = [json.loads(line) for line in inpt] 

print(lines['record'])

but got the following error:

Traceback (most recent call last):
  File "json-ifier.py", line 38, in <module>
    print(lines['record'])
TypeError: list indices must be integers, not str

Ideally I'd like to interact with it as I can with normal JSON, i.e. data['value']

EDIT II

with open('transactions000000000029.json', 'rb') as inpt:

    lines = [json.loads(line) for line in inpt]

    for line in lines: 
        records = [item['hash'] for item in lines]
    for item in records: 
        print item
9
  • 1
    Is each line valid JSON? eg: does lines = [json.loads(line) for line in inpt] do the job? Commented Sep 16, 2017 at 17:00
  • 1
    lines.append(json.loads(line))? Commented Sep 16, 2017 at 17:00
  • yes but I don't want to process each line- I want to process the file as a whole- the real one has millions of records Commented Sep 16, 2017 at 17:04
  • 1
    In what way does [json.loads(line) for line in inpt] not constitute "processing the file as a whole" ? Commented Sep 16, 2017 at 17:08
  • 1
    I'm quite confused now. If this file were valid JSON, it would be a list, right? What type do you want to interpret it as? Commented Sep 16, 2017 at 17:10

2 Answers 2

2

This looks like NDJSON that I've been working with recently. The specification is here and I'm not sure of its usefulness. Does the following work?

with open('the file.json', 'rb') as infile:
    data = infile.readlines()
    data = [json.loads(item.replace('\n', '')) for item in data] 

This should give you a list of dictionaries.

Sign up to request clarification or add additional context in comments.

13 Comments

when I tried it out just now I got this error print(data['record']) TypeError: list indices must be integers, not str, how can I verify that this works?
Because this parses the file and gives you a list of dictionaries, not a dictionary.
but I want to interact with it like I can with json, in normal json I can call things like data['record'] you know what I mean?
damn- I'm sorry it was exactly the data[0]['record']- thank you for your great help!~ :)
@s.matthew.english it's still a list, so items() is out. records = [item['record'] for item in data] should do it? I guess the point of the format is that every line is valid json, but the file as a whole is not. I find this a bit uncomfortable too, but you do just have a list of dictionaries so if you know how to iterate through lists and grab things by key, it's not that bad.
|
2

Each line looks like a valid JSON document.

That's "JSON Lines" format (http://jsonlines.org/)

Try to process each line independantly (json.loads(line)) or use a specialized library (https://jsonlines.readthedocs.io/en/latest/).

def process(oneline):
    # do what you want with each line
    print(oneline['record'])

with open('toy_two.json', 'rb') as inpt:
    for line in inpt:
        process(json.loads(line))

5 Comments

I'd like to process the file as a whole- as the real one has millions of records
So ? You can just iterate on each line of the input file as you do in your code, and apply json.loads(line) inside the 'for' loop.
sounds expensive, I want to do it cheap and fast
If you store all parsed lines in a global list, yes this is going to be expensive in RAM. If you process each line independantly, then you only use a bit of memory for the current line. That's "flow based programming".
ok cool- it was just the data[0]['record'] issue- anyway- thank you for these great insights!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.