0

I have had to create multple JSON files for reasons of processing a corpus (using GNRD http://gnrd.globalnames.org/ for scientific name extraction). I now want to use these JSON files to annotate said corpus as a whole.

I am trying to merge the multiple JSON files in Python. The contents of each JSON files are arrays of just scientific_name (key) and the name found (value). Below is an example of one of the shorter files:

{  
  "file":"biodiversity_trophic_9.txt",
  "names":[  
    {  
      "scientificName":"Bufo"
    },
    {  
      "scientificName":"Eleutherodactylus talamancae"
    },
    {  
      "scientificName":"E. punctariolus"
    },
    {  
      "scientificName":"Norops lionotus"
    },
    {  
      "scientificName":"Centrolenella prosoblepon"
    },
    {  
      "scientificName":"Sibon annulatus"
    },
    {  
      "scientificName":"Colostethus flotator"
    },
    {  
      "scientificName":"C. inguinalis"
    },
    {  
      "scientificName":"Eleutherodactylus"
    },
    {  
      "scientificName":"Hyla columba"
    },
    {  
      "scientificName":"Bufo haematiticus"
    },
    {  
      "scientificName":"S. annulatus"
    },
    {  
      "scientificName":"Leptodeira septentrionalis"
    },
    {  
      "scientificName":"Imantodes cenchoa"
    },
    {  
      "scientificName":"Oxybelis brevirostris"
    },
    {  
      "scientificName":"Cressa"
    },
    {  
      "scientificName":"Coloma"
    },
    {  
      "scientificName":"Perlidae"
    },
    {  
      "scientificName":"Hydropsychidae"
    },
    {  
      "scientificName":"Hyla"
    },
    {  
      "scientificName":"Norops"
    },
    {  
      "scientificName":"Hyla colymbiphyllum"
    },
    {  
      "scientificName":"Colostethus inguinalis"
    },
    {  
      "scientificName":"Oxybelis"
    },
    {  
      "scientificName":"Rana warszewitschii"
    },
    {  
      "scientificName":"R. warszewitschii"
    },
    {  
      "scientificName":"Rhyacophilidae"
    },
    {  
      "scientificName":"Daphnia magna"
    },
    {  
      "scientificName":"Hyla colymba"
    },
    {  
      "scientificName":"Centrolenella"
    },
    {  
      "scientificName":"Orconectes nais"
    },
    {  
      "scientificName":"Orconectes neglectus"
    },
    {  
      "scientificName":"Campostoma anomalum"
    },
    {  
      "scientificName":"Caridina"
    },
    {  
      "scientificName":"Decapoda"
    },
    {  
      "scientificName":"Atyidae"
    },
    {  
      "scientificName":"Cerastoderma edule"
    },
    {  
      "scientificName":"Rana aurora"
    },
    {  
      "scientificName":"Riffle"
    },
    {  
      "scientificName":"Calopterygidae"
    },
    {  
      "scientificName":"Elmidae"
    },
    {  
      "scientificName":"Gyrinidae"
    },
    {  
      "scientificName":"Gerridae"
    },
    {  
      "scientificName":"Naucoridae"
    },
    {  
      "scientificName":"Oligochaeta"
    },
    {  
      "scientificName":"Veliidae"
    },
    {  
      "scientificName":"Libellulidae"
    },
    {  
      "scientificName":"Philopotamidae"
    },
    {  
      "scientificName":"Ephemeroptera"
    },
    {  
      "scientificName":"Psephenidae"
    },
    {  
      "scientificName":"Baetidae"
    },
    {  
      "scientificName":"Corduliidae"
    },
    {  
      "scientificName":"Zygoptera"
    },
    {  
      "scientificName":"B. buto"
    },
    {  
      "scientificName":"C. euknemos"
    },
    {  
      "scientificName":"C. ilex"
    },
    {  
      "scientificName":"E. padi noblei"
    },
    {  
      "scientificName":"E. padi"
    },
    {  
      "scientificName":"E. bufo"
    },
    {  
      "scientificName":"E. butoni"
    },
    {  
      "scientificName":"E. crassi"
    },
    {  
      "scientificName":"E. cruentus"
    },
    {  
      "scientificName":"H. colymbiphyllum"
    },
    {  
      "scientificName":"N. aterina"
    },
    {  
      "scientificName":"S. ilex"
    },
    {  
      "scientificName":"Anisoptera"
    },
    {  
      "scientificName":"Riffle delta"
    }
  ],
  "total":67,
  "status":200,
  "unique":true,
  "engines":[  
    "TaxonFinder",
    "NetiNeti"
  ],
  "verbatim":false,
  "input_url":null,
  "token_url":"http://gnrd.globalnames.org/name_finder.html?token=2rtc4e70st",
  "parameters":{  
    "engine":0,
    "return_content":false,
    "best_match_only":false,
    "data_source_ids":[  

    ],
    "detect_language":true,
    "all_data_sources":false,
    "preferred_data_sources":[  

    ]
  },
  "execution_time":{  
    "total_duration":3.1727607250213623,
    "find_names_duration":1.9656541347503662,
    "text_preparation_duration":1.000107765197754
  },
  "english_detected":true
}

The issue I have is that there may be duplicates across the files, which I want to remove (otherwise I could just concatenate the files I guess). The queries I have seen otherwise are referring to merging extra keys and values to extend the arrays themselves.

Can anyone give me guidance on how to overcome this issue?

3
  • 3
    Load the json files into python, as whatever object types represent them. Then merge those objects using whatever logic you need (there is no generic 'please merge these' rule, you need to determine how something makes sense to be merged/what the resulting object should look like). Then serialise that merged object back into Json. Commented Oct 5, 2017 at 10:58
  • 1
    Can you give an example of expected result? Commented Oct 5, 2017 at 11:02
  • Thank you for your comments. The expected result would be having all the files in one, preferably having deleted any duplicates along the way as I think duplicates might cause problems in the subsequent annotation I want to perform on the corpus. All the files are as the one described above, but there are 15 files, each with the introduction and ending with the number of entries, time taken for the search etc. Would it be best to delete this manually from each file first? Commented Oct 5, 2017 at 12:07

1 Answer 1

2

If I understand correctly, you want to get all "scientificNames" values in the "names" elements of a batch of files. If I'm wrong, you should give an expected output to make things easier to understand.

I'd do something like that:

all_names = set() # use a set to avoid duplicates

# put all your files in there
for filename in ('file1.json', 'file2.json', ....):
    try:
        with open(filename, 'rt') as finput:
            data = json.load(finput)
        for name in data.get('names'):
            all_names.add(name.get('scientificName')
    except Exception as exc:
        print("Skipped file {} because exception {}".format(filename, str(exc))

print(all_names)

And in case you want to get a similar format than the initial files, add:

import pprint
pprint({"names:": {"scientificName": name for name in all_names}, "total": len(all_names)})
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.