Parse XML in Python without manually calling attribute, tags and child number

Question

I would like to create a Python script that goes through every child element starting from the root of the XML tree and scans for tags, attributes and containing text in the same sequence. Ideally all tag names in each node will be concatenated with attribute keys and the tag names of child nodes for coherence and better understanding of the text.

So in the following example by ElementTree

<?xml version="1.0"?>
<data>
    <country name="Liechtenstein">
        <rank>1</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
    </country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Malaysia" direction="N"/>
    </country>
    <country name="Panama">
        <rank>68</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        <neighbor name="Costa Rica" direction="W"/>
        <neighbor name="Colombia" direction="E"/>
    </country>
</data>

the optimal outcome will be

country.name Liechtenstein
country.rank 1
country.year 2008
country.gdppc 141100
country.neighbor.name Austria
country.neighbor.direction E
country.neighbor.name Switzerland
country.neighbor.direction W
country.name Singapore
country.rank 4
country.year 2011
country.gdppc 59900
country.neighbor.name Malaysia
country.neighbor.direction N
country.name Panama
country.rank 68
country.year 2011
country.gdppc 13600
country.neighbor.name Costa Rica
country.neighbor.direction W
country.neighbor.name Colombia
country.neighbor.direction E

The script that I've been working significantly lacks automation utility as it doesn't count the objects (tags attributes, text) within each step with the exception of child tags which are working fine as long as you can define their depth (in that case 2 for 2 loops). As you can see the text is seperated where it should be not, and None entries are included but they need to be excluded.

import xml.etree.ElementTree as ET
tree = ET.parse('country_data.xml')
root = tree.getroot()

for child in root:
    print(child.tag, child.attrib.keys(), child.attrib.get('name'))
    for child1 in child:
        print(child1.tag, child1.attrib.items())

for i in range(0,3):
    for j in range(0,3):
        print(root[i][j].text)

output is...

country dict_keys(['name']) Liechtenstein
rank dict_items([])
year dict_items([])
gdppc dict_items([])
neighbor dict_items([('name', 'Austria'), ('direction', 'E')])
neighbor dict_items([('name', 'Switzerland'), ('direction', 'W')])
country dict_keys(['name']) Singapore
rank dict_items([])
year dict_items([])
gdppc dict_items([])
neighbor dict_items([('name', 'Malaysia'), ('direction', 'N')])
country dict_keys(['name']) Panama
rank dict_items([])
year dict_items([])
gdppc dict_items([])
neighbor dict_items([('name', 'Costa Rica'), ('direction', 'W')])
neighbor dict_items([('name', 'Colombia'), ('direction', 'E')])
1
2008
141100
4
2011
59900
68
2011
13600

Magnetron · Accepted Answer · 2018-07-18 12:56:39Z

1

I feel like there should be a better library to work with xml files, but I haven't found one yet. Maybe there's room for improvement there. Anyway, this is a solution I came up with - the idea is to use a recursive function to extract as much detail as possible from every element, and return it to the above layer.

import xml.etree.ElementTree as ET

xml = ET.parse('p.xml')

root = xml.getroot()

def getDataRecursive(element):
    data = list()

    # get attributes of element, necessary for all elements
    for key in element.attrib.keys():
        data.append(element.tag + '.' + key + ' ' + element.attrib.get(key))

    # only end-of-line elements have important text, at least in this example
    if len(element) == 0:
        if element.text is not None:
            data.append(element.tag + ' ' + element.text)

    # otherwise, go deeper and add to the current tag
    else:
        for el in element:
            within = getDataRecursive(el)

            for data_point in within:
                data.append(element.tag + '.' + data_point)

    return data

# print results
for x in getDataRecursive(root):
    print(x)

answered Jul 18, 2018 at 12:56

Magnetron

3771 silver badge7 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

civy Over a year ago

Thank you for your useful answer. Do you know how to remove annoying namespaces in URLs? Take for example the actor dataset from ET documentation, the output with your code looks like this imgur.com/K5zDKWA

civy Over a year ago

I've figured it out by adding re.sub(r'\s*\{.*?\}\s*', '', x) while going through the list. Thanks again!!

Magnetron Over a year ago

That should work fine, depending on the available tags. I would have gone with a simpler r'^\{.*\}', but again it depends.

Collectives™ on Stack Overflow

Parse XML in Python without manually calling attribute, tags and child number

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related