How to quickly create a two dimension numpy array from json elements?

Question

I have nested json data that resembles the following:

[{'game':'001', 'animals': [{'name':'Dog', 'colour':'Red'}, {'name':'Horse', 'age':'6'},{'name':'Ostrich', 'location':'Africa'}]},{'game':'002', 'animals': [{'name':'Cat', 'colour':'Green'}, {'name':'Bison', 'location':'North America'},{'name':'Parrot', 'location':'Southeast Asia'}]}]

My objective is to create an indicator array entry for each animal (contained in 'name') corresponding to items in the variable "animal_list":

animal_list = ['Bison', 'Cat', 'Dog', 'Elephants', 'Horse', 'Ostrich', 'Parrot']

So the desired structure would resemble (expressed as a csv...but this is illustrative only since an numpy positional array is what i'm seeking):

Game, Bison, Cat, Dog, Elephants, Horse, Ostrich, Parrot
"001",0,0,1,0,1,1,0
"002",1,1,0,0,0,0,1

I have traditionally formed this using a "double-loop" - first on 'game' items; followed by an inner loop that scans through the 'name' items. Problem is, I have a long json list and it is taking hours to run.

Thanks for your help!

If you show us your traditional method it would be easier to suggest improvements. It also makes testing easier. — hpaulj
– hpaulj, Commented Dec 27, 2020 at 18:03
json is a string; loads makes a dictionary. There are only 2 ways to access dictionary elements - by key indexing or via items lists. numpy does not have an magic to do either of these faster. — hpaulj
– hpaulj, Commented Dec 27, 2020 at 18:09

Lior Cohen · Accepted Answer · 2020-12-27 16:03:10Z

1

Below is the pandas version of the table.

You can always refer to the ndarray as df.values

import numpy as np
import pandas as pd

data = [{'game': '001', 'animals': [{'name':'Dog', 'colour':'Red'}, {'name':'Horse', 'age':'6'},{'name':'Ostrich', 'location':'Africa'}]},
        {'game': '002', 'animals': [{'name':'Cat', 'colour':'Green'}, {'name':'Bison', 'location':'North America'},{'name':'Parrot', 'location':'Southeast Asia'}]}]
animal_list = ['Bison', 'Cat', 'Dog', 'Elephants', 'Horse', 'Ostrich', 'Parrot']

games = [d['game'] for d in data]

df = pd.DataFrame(np.zeros((len(games), len(animal_list))),
                  index=games, columns=animal_list)

for ix, g in enumerate(games):
    a = [a['name'] for a in data[ix]['animals']]
    df.loc[g, a] = 1

print(df)


       Bison  Cat  Dog  Elephants  Horse  Ostrich  Parrot
001    0.0  0.0  1.0        0.0    1.0      1.0     0.0
002    1.0  1.0  0.0        0.0    0.0      0.0     1.0

edited Dec 27, 2020 at 16:03

answered Dec 27, 2020 at 15:27

Lior Cohen

5,7202 gold badges18 silver badges33 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

C. Cooney Over a year ago

Thanks very much for this. Much more elegant than the loop I am using. A question - I have been under the impression that dataframe lookups are slow. Might it be faster to use a numpy array and append?

Lior Cohen Over a year ago

basically u are right. but in your case because the original data is labeled by strings and not by positions, you will have to manually convert the labels to indexes. This will probably cost you more and will be less elegant.

Collectives™ on Stack Overflow

How to quickly create a two dimension numpy array from json elements?

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related