3

I'm working on Python 2.6.6 and I'm struggling with one issue.

I have a large JSON file with the following structure:

{"id":"12345","ua":[{"n":"GROUP_A","v":["true"]}]}
{"id":"12345","ua":[{"n":"GROUP_B","v":["true"]}]}
{"id":"54321","ua":[{"n":"GROUP_C","v":["true"]}]}
{"id":"54321","ua":[{"n":"GROUP_D","v":["true"]}]}
{"id":"54321","ua":[{"n":"GROUP_E","v":["true"]}]}
{"id":"98765","ua":[{"n":"GROUP_F","v":["true"]}]}

And I need to merge the id's so they'll contain all the GROUPS as so:

{"id":"12345","ua":[{"n":"GROUP_A","v":["true"]},{"n":"GROUP_B","v":["true"]}]}
{"id":"54321","ua":[{"n":"GROUP_C","v":["true"]},{"n":"GROUP_D","v":["true"]},{"n":"GROUP_E","v":["true"]}]}
{"id":"98765","ua":[{"n":"GROUP_F","v":["true"]}]}

I tried using the 'json' library, but I couldn't append the values correctly. Also I tried to convert it all to a dictionary and append the values (GROUPS) to the key (id) as lists, but I got stuck on printing it all as I need to the output file.

I can do it using bash but it takes too long to parse all the information and rearrange it in the needed format.

Any help is appreciated!

Thanks.

4
  • 1
    did you manage to get it working as a dictionary but you just couldnt save it with the right format? Commented Oct 3, 2013 at 0:26
  • 2
    I don't think that's valid JSON, at least according to jsonlint.com Commented Oct 3, 2013 at 0:28
  • That's not a JSON structure; it's six separate JSON structures. Are they actually inside an array that you didn't show us? Or do you have a list of them or something? Commented Oct 3, 2013 at 0:32
  • 1
    Meanwhile, forget trying to "append to JSON structures". Import the JSON to native Python objects, process them as Python objects, then write them back out to JSON at the end. Commented Oct 3, 2013 at 0:33

1 Answer 1

5

First, let's get the JSON stuff out of the way.

Your file is not a JSON structure, it's a bunch of separate JSON objects. From your sample, it looks like it's one object per line. So, let's read this in to a list:

with open('spam.json') as f:
    things = [json.loads(line) for line in f]

Then we'll process this, and write it out:

with open('eggs.json', 'w') as f:
    for thing in new_things:
        f.write(json.dumps(thing) + '\n')

Now, you don't have a JSON structure that you want to append things to; you have a list of dicts, and you want to create a new list of dicts, merging together the ones with the same key.

Here's one way to do it:

new_things = {}
for thing in things:
    thing_id = thing['id']
    try:
        old_thing = new_things[thing_id]
    except KeyError:
        new_things[thing_id] = thing
    else:
        old_thing['ua'].extend(thing['ua'])
new_things = new_things.values()

There are a few different ways you could simplify this; I just wrote it this way because it uses no tricks that should be beyond a novice. For example, you could do by it sorting and grouping:

def merge(things):
    return {'id': things[0]['id'],
            'ua': list(itertools.chain.from_iterable(t['ua'] for t in things))}
sorted_things = sorted(things, key=operator.itemgetter('id'))
grouped_things = itertools.groupby(sorted_things, key=operator.itemgetter('id'))
new_things = [merge(list(group)) for key, group in grouped_things]

I didn't realize from you original question that you had tens of millions of rows. All of the above steps require loading the entire original data set into memory, processing with some temporary storage, then writing it back out. But if your data set is too large, you need to find a way to process one row at a time, and keep as little in memory simultaneously as possibly.


First, to process one row at a time, you just need to change the initial list comprehension to a generator expression, and move the rest of the code inside the with statement, like this:

with open('spam.json') as f:
    things = (json.loads(line) for line in f)
    for thing in things:
        # blah blah

… at which point it might be just as easy to rewrite it like this:

with open('spam.json') as f:
    for line in f:
        thing = json.loads(line)
        # blah blah

Next, sorting obviously builds the whole sorted list in memory, so that's not acceptable here. But if you don't sort and group, the entire new_things result object has to be alive at the same time (because the very last input row could have to be merged into the very first output row).


Your sample data seems to already have the rows sorted by id. If you can count on that in real life—or just count on the rows always being grouped by id—just skip the sorting step, which isn't doing anything but wasting time and memory, and use a grouping solution.


On the other hand, if you can't count on the rows being grouped by id, there are only really two ways to reduce memory further: compress the data in some way, or back the storage to disk.


For the first, Foo Bar User's solution built a simpler and smaller data structure (a dict mapping each id to its list of uas, instead of a list of dicts, each with an id and a ua), which should take less memory, and which we could convert to the final format one row at a time. Like this:

with open('spam.json') as f:
    new_dict = defaultdict(list)
    for row in f:
        thing = json.loads(row)
        new_dict[thing["id"]].extend(thing["ua"])
with open('eggs.json', 'w') as f:
    for id, ua in new_dict.items(): # use iteritems in Python 2.x
        thing = {'id': id, 'ua': ua}
        f.write(json.dumps(thing) + '\n')

For the second, Python comes with a nice way to use a dbm database as if it were a dictionary. If your values are just strings, you can use the anydbm/dbm module (or one of the specific implementations). Since your values are lists, you'll need to use shelve instead.

Anyway, while this will reduce your memory usage, it could slow things down. On a machine with 4GB of RAM, the savings in pagefile swapping will probably blow away the extra cost of going through the database… but on a machine with 16GB of RAM, you may just be adding overhead for very little gain. You may want to experiment with smaller files first, to see how much slower your code is with shelve vs. dict when memory isn't an issue.


Alternatively, if things get way beyond the limits of your memory, you can always use a more powerful database that actually can sort things on disk. For example (untested):

db = sqlite3.connect('temp.sqlite')
c = db.cursor()
c.execute('CREATE TABLE Things (tid, ua)')
for thing in things:
    for ua in thing['ua']:
        c.execute('INSERT INTO Things (tid, ua) VALUES (?, ?)',
                  thing['id'], ua)
c.commit()
c.execute('SELECT tid, ua FROM Things ORDER BY tid')
rows = iter(c.fetchone, None)
grouped_things = itertools.groupby(rows, key=operator.itemgetter(0))
new_things = (merge(list(group)) for key, group in grouped_things)
with open('eggs.json', 'w') as f:
    for thing in new_things:
        f.write(json.dumps(thing) + '\n')
Sign up to request clarification or add additional context in comments.

12 Comments

good job i thought defaultdict would be better, but this seems to be the best approeach.
@FooBarUser: The problem with defaultdict is that you'd need to write a pretty complicated function to create the structure you want and/or a complicated expression to update it. But as I said in a comment to your answer, if you just build the structure defaultdict can build easily, I think you can transform it into the right structure with a one-liner afterwards.
@FooBarUser: I've edited your answer to show what I mean. I think it's a better answer than my first version, and just as novice-friendly. (I still like my groupby solution, but I don't want to try explaining it…)
@abarnert - Thanks a lot! To your question in the question topic, it's not an array, it's just how the file structured, it contains ~10M records like this. I was able to run your code with one small adjustment, on the: with open('eggs.json', 'w') as f: for thing in new_things: f.write(json.dumps(thing) + '\n') I changed the new_things to things. I tried to run it over all the 10M records, but it fails with some encoding issues which I now try to isolate.
I was able to run it on the 10M records file, it took real 73m17.105s and I've seen that python took about 15g of memory while he ran it. I wonder whether something can be done in order to improve the resource consumption and the running time? Thanks!
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.