0

I have nested json data that resembles the following:

[{'game':'001', 'animals': [{'name':'Dog', 'colour':'Red'}, {'name':'Horse', 'age':'6'},{'name':'Ostrich', 'location':'Africa'}]},{'game':'002', 'animals': [{'name':'Cat', 'colour':'Green'}, {'name':'Bison', 'location':'North America'},{'name':'Parrot', 'location':'Southeast Asia'}]}]

My objective is to create an indicator array entry for each animal (contained in 'name') corresponding to items in the variable "animal_list":

animal_list = ['Bison', 'Cat', 'Dog', 'Elephants', 'Horse', 'Ostrich', 'Parrot']

So the desired structure would resemble (expressed as a csv...but this is illustrative only since an numpy positional array is what i'm seeking):

Game, Bison, Cat, Dog, Elephants, Horse, Ostrich, Parrot
"001",0,0,1,0,1,1,0
"002",1,1,0,0,0,0,1

I have traditionally formed this using a "double-loop" - first on 'game' items; followed by an inner loop that scans through the 'name' items. Problem is, I have a long json list and it is taking hours to run.

Thanks for your help!

5
  • Please provide the code to your current approach. Commented Dec 27, 2020 at 14:48
  • its still seems invalid Commented Dec 27, 2020 at 14:51
  • Why not use pandas? Commented Dec 27, 2020 at 14:57
  • If you show us your traditional method it would be easier to suggest improvements. It also makes testing easier. Commented Dec 27, 2020 at 18:03
  • 1
    json is a string; loads makes a dictionary. There are only 2 ways to access dictionary elements - by key indexing or via items lists. numpy does not have an magic to do either of these faster. Commented Dec 27, 2020 at 18:09

1 Answer 1

1

Below is the pandas version of the table.

You can always refer to the ndarray as df.values

import numpy as np
import pandas as pd

data = [{'game': '001', 'animals': [{'name':'Dog', 'colour':'Red'}, {'name':'Horse', 'age':'6'},{'name':'Ostrich', 'location':'Africa'}]},
        {'game': '002', 'animals': [{'name':'Cat', 'colour':'Green'}, {'name':'Bison', 'location':'North America'},{'name':'Parrot', 'location':'Southeast Asia'}]}]
animal_list = ['Bison', 'Cat', 'Dog', 'Elephants', 'Horse', 'Ostrich', 'Parrot']

games = [d['game'] for d in data]

df = pd.DataFrame(np.zeros((len(games), len(animal_list))),
                  index=games, columns=animal_list)

for ix, g in enumerate(games):
    a = [a['name'] for a in data[ix]['animals']]
    df.loc[g, a] = 1

print(df)


       Bison  Cat  Dog  Elephants  Horse  Ostrich  Parrot
001    0.0  0.0  1.0        0.0    1.0      1.0     0.0
002    1.0  1.0  0.0        0.0    0.0      0.0     1.0
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks very much for this. Much more elegant than the loop I am using. A question - I have been under the impression that dataframe lookups are slow. Might it be faster to use a numpy array and append?
basically u are right. but in your case because the original data is labeled by strings and not by positions, you will have to manually convert the labels to indexes. This will probably cost you more and will be less elegant.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.