0

I would like to create a Python script that goes through every child element starting from the root of the XML tree and scans for tags, attributes and containing text in the same sequence. Ideally all tag names in each node will be concatenated with attribute keys and the tag names of child nodes for coherence and better understanding of the text.

So in the following example by ElementTree

<?xml version="1.0"?>
<data>
    <country name="Liechtenstein">
        <rank>1</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
    </country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Malaysia" direction="N"/>
    </country>
    <country name="Panama">
        <rank>68</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        <neighbor name="Costa Rica" direction="W"/>
        <neighbor name="Colombia" direction="E"/>
    </country>
</data>

the optimal outcome will be

country.name Liechtenstein
country.rank 1
country.year 2008
country.gdppc 141100
country.neighbor.name Austria
country.neighbor.direction E
country.neighbor.name Switzerland
country.neighbor.direction W
country.name Singapore
country.rank 4
country.year 2011
country.gdppc 59900
country.neighbor.name Malaysia
country.neighbor.direction N
country.name Panama
country.rank 68
country.year 2011
country.gdppc 13600
country.neighbor.name Costa Rica
country.neighbor.direction W
country.neighbor.name Colombia
country.neighbor.direction E

The script that I've been working significantly lacks automation utility as it doesn't count the objects (tags attributes, text) within each step with the exception of child tags which are working fine as long as you can define their depth (in that case 2 for 2 loops). As you can see the text is seperated where it should be not, and None entries are included but they need to be excluded.

import xml.etree.ElementTree as ET
tree = ET.parse('country_data.xml')
root = tree.getroot()

for child in root:
    print(child.tag, child.attrib.keys(), child.attrib.get('name'))
    for child1 in child:
        print(child1.tag, child1.attrib.items())

for i in range(0,3):
    for j in range(0,3):
        print(root[i][j].text)

output is...

country dict_keys(['name']) Liechtenstein
rank dict_items([])
year dict_items([])
gdppc dict_items([])
neighbor dict_items([('name', 'Austria'), ('direction', 'E')])
neighbor dict_items([('name', 'Switzerland'), ('direction', 'W')])
country dict_keys(['name']) Singapore
rank dict_items([])
year dict_items([])
gdppc dict_items([])
neighbor dict_items([('name', 'Malaysia'), ('direction', 'N')])
country dict_keys(['name']) Panama
rank dict_items([])
year dict_items([])
gdppc dict_items([])
neighbor dict_items([('name', 'Costa Rica'), ('direction', 'W')])
neighbor dict_items([('name', 'Colombia'), ('direction', 'E')])
1
2008
141100
4
2011
59900
68
2011
13600

1 Answer 1

1

I feel like there should be a better library to work with xml files, but I haven't found one yet. Maybe there's room for improvement there. Anyway, this is a solution I came up with - the idea is to use a recursive function to extract as much detail as possible from every element, and return it to the above layer.

import xml.etree.ElementTree as ET

xml = ET.parse('p.xml')

root = xml.getroot()

def getDataRecursive(element):
    data = list()

    # get attributes of element, necessary for all elements
    for key in element.attrib.keys():
        data.append(element.tag + '.' + key + ' ' + element.attrib.get(key))

    # only end-of-line elements have important text, at least in this example
    if len(element) == 0:
        if element.text is not None:
            data.append(element.tag + ' ' + element.text)

    # otherwise, go deeper and add to the current tag
    else:
        for el in element:
            within = getDataRecursive(el)

            for data_point in within:
                data.append(element.tag + '.' + data_point)

    return data

# print results
for x in getDataRecursive(root):
    print(x)
Sign up to request clarification or add additional context in comments.

3 Comments

Thank you for your useful answer. Do you know how to remove annoying namespaces in URLs? Take for example the actor dataset from ET documentation, the output with your code looks like this imgur.com/K5zDKWA
I've figured it out by adding re.sub(r'\s*\{.*?\}\s*', '', x) while going through the list. Thanks again!!
That should work fine, depending on the available tags. I would have gone with a simpler r'^\{.*\}', but again it depends.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.