I would like to create a Python script that goes through every child element starting from the root of the XML tree and scans for tags, attributes and containing text in the same sequence. Ideally all tag names in each node will be concatenated with attribute keys and the tag names of child nodes for coherence and better understanding of the text.
So in the following example by ElementTree
<?xml version="1.0"?>
<data>
<country name="Liechtenstein">
<rank>1</rank>
<year>2008</year>
<gdppc>141100</gdppc>
<neighbor name="Austria" direction="E"/>
<neighbor name="Switzerland" direction="W"/>
</country>
<country name="Singapore">
<rank>4</rank>
<year>2011</year>
<gdppc>59900</gdppc>
<neighbor name="Malaysia" direction="N"/>
</country>
<country name="Panama">
<rank>68</rank>
<year>2011</year>
<gdppc>13600</gdppc>
<neighbor name="Costa Rica" direction="W"/>
<neighbor name="Colombia" direction="E"/>
</country>
</data>
the optimal outcome will be
country.name Liechtenstein
country.rank 1
country.year 2008
country.gdppc 141100
country.neighbor.name Austria
country.neighbor.direction E
country.neighbor.name Switzerland
country.neighbor.direction W
country.name Singapore
country.rank 4
country.year 2011
country.gdppc 59900
country.neighbor.name Malaysia
country.neighbor.direction N
country.name Panama
country.rank 68
country.year 2011
country.gdppc 13600
country.neighbor.name Costa Rica
country.neighbor.direction W
country.neighbor.name Colombia
country.neighbor.direction E
The script that I've been working significantly lacks automation utility as it doesn't count the objects (tags attributes, text) within each step with the exception of child tags which are working fine as long as you can define their depth (in that case 2 for 2 loops). As you can see the text is seperated where it should be not, and None entries are included but they need to be excluded.
import xml.etree.ElementTree as ET
tree = ET.parse('country_data.xml')
root = tree.getroot()
for child in root:
print(child.tag, child.attrib.keys(), child.attrib.get('name'))
for child1 in child:
print(child1.tag, child1.attrib.items())
for i in range(0,3):
for j in range(0,3):
print(root[i][j].text)
output is...
country dict_keys(['name']) Liechtenstein
rank dict_items([])
year dict_items([])
gdppc dict_items([])
neighbor dict_items([('name', 'Austria'), ('direction', 'E')])
neighbor dict_items([('name', 'Switzerland'), ('direction', 'W')])
country dict_keys(['name']) Singapore
rank dict_items([])
year dict_items([])
gdppc dict_items([])
neighbor dict_items([('name', 'Malaysia'), ('direction', 'N')])
country dict_keys(['name']) Panama
rank dict_items([])
year dict_items([])
gdppc dict_items([])
neighbor dict_items([('name', 'Costa Rica'), ('direction', 'W')])
neighbor dict_items([('name', 'Colombia'), ('direction', 'E')])
1
2008
141100
4
2011
59900
68
2011
13600