I have a XML file structured as shown below (simplified for the purposes of this question). For each record, I want to extract the article title and the value of the attribute "IdType" containing the DOI number in the "ArticleId" element (sometimes this attribute can be missing), and then store the article title in a dictionary with DOI as the key.
<PubmedArticleSet>
<PubmedArticle>
<MedlineCitation Status="MEDLINE" Owner="NLM">
<Article PubModel="Print-Electronic">
<ArticleTitle>Malathion and dithane induce DNA damage in Vicia faba.</ArticleTitle>
</Article>
</MedlineCitation>
<PubmedData>
<ArticleIdList>
<ArticleId IdType="pubmed">28950791</ArticleId>
<ArticleId IdType="doi">10.1177/0748233717726877</ArticleId>
</ArticleIdList>
</PubmedData>
</PubmedArticle>
<PubmedArticle>
<MedlineCitation Status="MEDLINE" Owner="NLM">
<Article PubModel="Print-Electronic">
<ArticleTitle>Impact of dual inoculation with Rhizobium and PGPR on growth and antioxidant status of Vicia faba L. under copper stress.</ArticleTitle>
</Article>
</MedlineCitation>
<PubmedData>
<ArticleIdList>
<ArticleId IdType="pubmed">25747267</ArticleId>
<ArticleId IdType="pii">S1631-0691(15)00050-5</ArticleId>
<ArticleId IdType="doi">10.1016/j.crvi.2015.02.001</ArticleId>
</ArticleIdList>
</PubmedData>
</PubmedArticle>
<PubmedArticle>
<MedlineCitation Status="MEDLINE" IndexingMethod="Curated" Owner="NLM">
<Article PubModel="Print-Electronic">
<ArticleTitle>[Influence of Four Kinds of PPCPs on Micronucleus Rate of the Root-Tip Cells of Vicia-faba and Garlic].</ArticleTitle>
</Article>
</MedlineCitation>
<PubmedData>
<ArticleIdList>
<ArticleId IdType="pubmed">27548984</ArticleId>
<!-- in this record, DOI is missing -->
</ArticleIdList>
</PubmedData>
</PubmedArticle>
</PubmedArticleSet>
In a lame attempt to achieve that, I used xml.etree.ElementTree, as follows:
import xml.etree.ElementTree as ET
xmldoc = ET.parse('sample.xml')
root = xmldoc.getroot()
pubs = {}
for elem in xmldoc.iter(tag='ArticleTitle'):
title = elem.text
for subelem in xmldoc.iter(tag='ArticleId'):
if subelem.get("IdType") == "doi":
doi = subelem.text
pubs[doi] = title
if len(pubs) == 0:
print "No articles found"
else:
for pub in pubs.keys():
print pub + ' ' + pubs[pub]
But there is a problem with the loops traversing the document tree, because the above code results in:
10.1177/0748233717726877 [Influence of Four Kinds of PPCPs on Micronucleus Rate of the Root-Tip Cells of Vicia-faba and Garlic]. 10.1016/j.crvi.2015.02.001 [Influence of Four Kinds of PPCPs on Micronucleus Rate of the Root-Tip Cells of Vicia-faba and Garlic].
That is, I get the correct DOI's but just a duplicate of the last article title, which has no DOI!
The correct output should be:
10.1177/0748233717726877 Malathion and dithane induce DNA damage in Vicia faba. 10.1016/j.crvi.2015.02.001 Impact of dual inoculation with Rhizobium and PGPR on growth and antioxidant status of Vicia faba L. under copper stress.
Could anyone provide me with some hint towards solving this annoying problem?