How to merge multiple JSON files in Python

Question

I have had to create multple JSON files for reasons of processing a corpus (using GNRD http://gnrd.globalnames.org/ for scientific name extraction). I now want to use these JSON files to annotate said corpus as a whole.

I am trying to merge the multiple JSON files in Python. The contents of each JSON files are arrays of just scientific_name (key) and the name found (value). Below is an example of one of the shorter files:

{  
  "file":"biodiversity_trophic_9.txt",
  "names":[  
    {  
      "scientificName":"Bufo"
    },
    {  
      "scientificName":"Eleutherodactylus talamancae"
    },
    {  
      "scientificName":"E. punctariolus"
    },
    {  
      "scientificName":"Norops lionotus"
    },
    {  
      "scientificName":"Centrolenella prosoblepon"
    },
    {  
      "scientificName":"Sibon annulatus"
    },
    {  
      "scientificName":"Colostethus flotator"
    },
    {  
      "scientificName":"C. inguinalis"
    },
    {  
      "scientificName":"Eleutherodactylus"
    },
    {  
      "scientificName":"Hyla columba"
    },
    {  
      "scientificName":"Bufo haematiticus"
    },
    {  
      "scientificName":"S. annulatus"
    },
    {  
      "scientificName":"Leptodeira septentrionalis"
    },
    {  
      "scientificName":"Imantodes cenchoa"
    },
    {  
      "scientificName":"Oxybelis brevirostris"
    },
    {  
      "scientificName":"Cressa"
    },
    {  
      "scientificName":"Coloma"
    },
    {  
      "scientificName":"Perlidae"
    },
    {  
      "scientificName":"Hydropsychidae"
    },
    {  
      "scientificName":"Hyla"
    },
    {  
      "scientificName":"Norops"
    },
    {  
      "scientificName":"Hyla colymbiphyllum"
    },
    {  
      "scientificName":"Colostethus inguinalis"
    },
    {  
      "scientificName":"Oxybelis"
    },
    {  
      "scientificName":"Rana warszewitschii"
    },
    {  
      "scientificName":"R. warszewitschii"
    },
    {  
      "scientificName":"Rhyacophilidae"
    },
    {  
      "scientificName":"Daphnia magna"
    },
    {  
      "scientificName":"Hyla colymba"
    },
    {  
      "scientificName":"Centrolenella"
    },
    {  
      "scientificName":"Orconectes nais"
    },
    {  
      "scientificName":"Orconectes neglectus"
    },
    {  
      "scientificName":"Campostoma anomalum"
    },
    {  
      "scientificName":"Caridina"
    },
    {  
      "scientificName":"Decapoda"
    },
    {  
      "scientificName":"Atyidae"
    },
    {  
      "scientificName":"Cerastoderma edule"
    },
    {  
      "scientificName":"Rana aurora"
    },
    {  
      "scientificName":"Riffle"
    },
    {  
      "scientificName":"Calopterygidae"
    },
    {  
      "scientificName":"Elmidae"
    },
    {  
      "scientificName":"Gyrinidae"
    },
    {  
      "scientificName":"Gerridae"
    },
    {  
      "scientificName":"Naucoridae"
    },
    {  
      "scientificName":"Oligochaeta"
    },
    {  
      "scientificName":"Veliidae"
    },
    {  
      "scientificName":"Libellulidae"
    },
    {  
      "scientificName":"Philopotamidae"
    },
    {  
      "scientificName":"Ephemeroptera"
    },
    {  
      "scientificName":"Psephenidae"
    },
    {  
      "scientificName":"Baetidae"
    },
    {  
      "scientificName":"Corduliidae"
    },
    {  
      "scientificName":"Zygoptera"
    },
    {  
      "scientificName":"B. buto"
    },
    {  
      "scientificName":"C. euknemos"
    },
    {  
      "scientificName":"C. ilex"
    },
    {  
      "scientificName":"E. padi noblei"
    },
    {  
      "scientificName":"E. padi"
    },
    {  
      "scientificName":"E. bufo"
    },
    {  
      "scientificName":"E. butoni"
    },
    {  
      "scientificName":"E. crassi"
    },
    {  
      "scientificName":"E. cruentus"
    },
    {  
      "scientificName":"H. colymbiphyllum"
    },
    {  
      "scientificName":"N. aterina"
    },
    {  
      "scientificName":"S. ilex"
    },
    {  
      "scientificName":"Anisoptera"
    },
    {  
      "scientificName":"Riffle delta"
    }
  ],
  "total":67,
  "status":200,
  "unique":true,
  "engines":[  
    "TaxonFinder",
    "NetiNeti"
  ],
  "verbatim":false,
  "input_url":null,
  "token_url":"http://gnrd.globalnames.org/name_finder.html?token=2rtc4e70st",
  "parameters":{  
    "engine":0,
    "return_content":false,
    "best_match_only":false,
    "data_source_ids":[  

    ],
    "detect_language":true,
    "all_data_sources":false,
    "preferred_data_sources":[  

    ]
  },
  "execution_time":{  
    "total_duration":3.1727607250213623,
    "find_names_duration":1.9656541347503662,
    "text_preparation_duration":1.000107765197754
  },
  "english_detected":true
}

The issue I have is that there may be duplicates across the files, which I want to remove (otherwise I could just concatenate the files I guess). The queries I have seen otherwise are referring to merging extra keys and values to extend the arrays themselves.

Can anyone give me guidance on how to overcome this issue?

Load the json files into python, as whatever object types represent them. Then merge those objects using whatever logic you need (there is no generic 'please merge these' rule, you need to determine how something makes sense to be merged/what the resulting object should look like). Then serialise that merged object back into Json. — user955340
– user955340, Commented Oct 5, 2017 at 10:58
Thank you for your comments. The expected result would be having all the files in one, preferably having deleted any duplicates along the way as I think duplicates might cause problems in the subsequent annotation I want to perform on the corpus. All the files are as the one described above, but there are 15 files, each with the introduction and ending with the number of entries, time taken for the search etc. Would it be best to delete this manually from each file first? — Sandra Young
– Sandra Young, Commented Oct 5, 2017 at 12:07

Guillaume · Accepted Answer · 2017-10-05 15:38:41Z

2

If I understand correctly, you want to get all "scientificNames" values in the "names" elements of a batch of files. If I'm wrong, you should give an expected output to make things easier to understand.

I'd do something like that:

all_names = set() # use a set to avoid duplicates

# put all your files in there
for filename in ('file1.json', 'file2.json', ....):
    try:
        with open(filename, 'rt') as finput:
            data = json.load(finput)
        for name in data.get('names'):
            all_names.add(name.get('scientificName')
    except Exception as exc:
        print("Skipped file {} because exception {}".format(filename, str(exc))

print(all_names)

And in case you want to get a similar format than the initial files, add:

import pprint
pprint({"names:": {"scientificName": name for name in all_names}, "total": len(all_names)})

edited Oct 5, 2017 at 15:38

answered Oct 5, 2017 at 11:45

Guillaume

6,1293 gold badges28 silver badges48 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How to merge multiple JSON files in Python

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related