Problem traversing XML tree with Python xml.etree.ElementTree

Question

I have a XML file structured as shown below (simplified for the purposes of this question). For each record, I want to extract the article title and the value of the attribute "IdType" containing the DOI number in the "ArticleId" element (sometimes this attribute can be missing), and then store the article title in a dictionary with DOI as the key.

<PubmedArticleSet>
<PubmedArticle>
    <MedlineCitation Status="MEDLINE" Owner="NLM">
        <Article PubModel="Print-Electronic">
            <ArticleTitle>Malathion and dithane induce DNA damage in Vicia faba.</ArticleTitle>
        </Article>
    </MedlineCitation>  
    <PubmedData>
        <ArticleIdList>
            <ArticleId IdType="pubmed">28950791</ArticleId>
            <ArticleId IdType="doi">10.1177/0748233717726877</ArticleId>
        </ArticleIdList>
    </PubmedData>
</PubmedArticle>

<PubmedArticle>
    <MedlineCitation Status="MEDLINE" Owner="NLM">
        <Article PubModel="Print-Electronic">
            <ArticleTitle>Impact of dual inoculation with Rhizobium and PGPR on growth and antioxidant status of Vicia faba L. under copper stress.</ArticleTitle>
        </Article>
    </MedlineCitation>  
    <PubmedData>
        <ArticleIdList>
            <ArticleId IdType="pubmed">25747267</ArticleId>
            <ArticleId IdType="pii">S1631-0691(15)00050-5</ArticleId>
            <ArticleId IdType="doi">10.1016/j.crvi.2015.02.001</ArticleId>
        </ArticleIdList>
    </PubmedData>
</PubmedArticle>

<PubmedArticle>
    <MedlineCitation Status="MEDLINE" IndexingMethod="Curated" Owner="NLM">
        <Article PubModel="Print-Electronic">
            <ArticleTitle>[Influence of Four Kinds of PPCPs on Micronucleus Rate of the Root-Tip Cells of Vicia-faba and Garlic].</ArticleTitle>
        </Article>
    </MedlineCitation>
    <PubmedData>
    <ArticleIdList>
        <ArticleId IdType="pubmed">27548984</ArticleId>
        <!-- in this record, DOI is missing -->
    </ArticleIdList>
    </PubmedData>
</PubmedArticle>
</PubmedArticleSet>

In a lame attempt to achieve that, I used xml.etree.ElementTree, as follows:

import xml.etree.ElementTree as ET

xmldoc = ET.parse('sample.xml')
root = xmldoc.getroot()
pubs = {}
for elem in xmldoc.iter(tag='ArticleTitle'):
    title = elem.text
    for subelem in xmldoc.iter(tag='ArticleId'):
        if subelem.get("IdType") == "doi":
            doi = subelem.text 
            pubs[doi] = title

if len(pubs) == 0:
   print "No articles found"
else:   
   for pub in pubs.keys():
       print pub + ' ' + pubs[pub]

But there is a problem with the loops traversing the document tree, because the above code results in:

10.1177/0748233717726877 [Influence of Four Kinds of PPCPs on Micronucleus Rate of the Root-Tip Cells of Vicia-faba and Garlic].
10.1016/j.crvi.2015.02.001 [Influence of Four Kinds of PPCPs on Micronucleus Rate of the Root-Tip Cells of Vicia-faba and Garlic].

That is, I get the correct DOI's but just a duplicate of the last article title, which has no DOI!

The correct output should be:

10.1177/0748233717726877 Malathion and dithane induce DNA damage in Vicia faba.
10.1016/j.crvi.2015.02.001 Impact of dual inoculation with Rhizobium and PGPR on growth and antioxidant status of Vicia faba L. under copper stress.

Could anyone provide me with some hint towards solving this annoying problem?

Tomalak · Accepted Answer · 2019-01-31 18:20:27Z

1

This is fundamentally wrong:

for elem in xmldoc.iter(tag='ArticleTitle'):      # <-- *ALL* <ArticleTitle> elements
    ...
    for subelem in xmldoc.iter(tag='ArticleId'):  # <-- *ALL* <ArticleId> elements
        ...

There is no mind-reading in ElementTree that only selects the <ArticleId> that are associated to the last <ArticleTitle> you happened to look at, so anything you find with that code won't actually be related.

Structure your code around the actual XML document ("for each PubmedArticle...") and use relative searches:

pubs = []

for pubmedArticle in xmldoc.iter(tag='PubmedArticle'):  
    # relative search within this <PubmedArticle>
    articleTitle = pubmedArticle.find('./MedlineCitation/Article/ArticleTitle')

    # always verify that there are actual results for a search
    if articleTitle == None:
       title = "No article title found"
    else:
       title = articleTitle.text

    for articleId in pubmedArticle.iterfind('./PubmedData//ArticleId'):
        if articleId.get("IdType") == "doi":
            pubs.append({"doi": articleId.text, "title": title})

I would also recommend making a list of dicts, instead of a single dict. It will be easier to handle in your following code:

[
    {'doi': '10.1177/0748233717726877', 'title': 'Malathion and dithane induce DNA damage in Vicia faba.'},
    {'doi': '10.1016/j.crvi.2015.02.001', 'title': 'Impact of dual inoculation with Rhizobium and PGPR on growth and antioxidant status of Vicia faba L. under copper stress.'}
]

answered Jan 31, 2019 at 18:20

Tomalak

339k68 gold badges547 silver badges635 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

maurobio Over a year ago

Tomalak: Thanks a lot for your helpful answer. It worked like a charm on Python 2.7, but then I stumbled upon another problem: both "iter" and "iterfind" are only available on Python 2.7+ and I need to run my script on Python 2.6.6 (it is on a remote server which just offers that version), Following the instructionsfrom here (stackoverflow.com/questions/31682186/…) I found a simple substitute for "iter", but could not apply the solution with "findall" (for articleId in xmldoc.findall('./PubmedData//ArticleId')):, which just returned nothing.

Tomalak Over a year ago

You are searching xmldoc again. Don't do that. Use relative searches.

maurobio Over a year ago

Yep, I just got distracted by the example from the similar question. The code is now working just fine on Python 2.6 with the line "for articleId in pubmedArticle.findall('.//ArticleId'):" in place of iterfind; Thanks again for your much appreciated help.

Collectives™ on Stack Overflow

Problem traversing XML tree with Python xml.etree.ElementTree

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related