Converting xml data using Python

Question

I have a large xml data. The format is given as following:

  <person>
        <name>Tom</name>
        <age>18</age>
        <address> London, xxx street, xxx building</address>
  </person>
  <person>
        <name>John</name>
        <age>22</age>
        <address> Canberra, xxx street, xxx building, xxx floor, xxx room,
                 xxx  bed</address>
  </person>

be careful about the address!

I want to get the following result:

  Tom^18^London, xxx street, xxx building
  John^22^Canberra, xxx street, xxx building, xxx floor, xxx room,xxx  bed

The data is fairly large, and I hope I can read line by line thus there will not be the problem of memory. Thanks.

I don't know why it looks in this way, however, I get the data in this format. — Frank Wang
– Frank Wang, Commented Mar 13, 2012 at 15:04
That's not XML. You need to tell whoever is sending you this kind of data to fix his "xml" generation code so it actually produces XML. — ThiefMaster
– ThiefMaster, Commented Mar 13, 2012 at 15:04

rplnt · Accepted Answer · 2012-03-13 14:57:56Z

6

Don't torture yourself by parsing strings if there is an xml library available -- lxml for example. If you want to stick with python's batteries included then have a look at dom.minidom.

answered Mar 13, 2012 at 14:57

rplnt

2,41917 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Frank Wang Over a year ago

Thanks. I want to know whether I could you the lxml, since the data is very large. Using python to read line by line, it's possible to do it.

Daniel Haley Over a year ago

@FrankWANG - For streaming XML, take a look at sax (wiki.python.org/moin/Sax)

D_Bye Over a year ago

@FrankWANG - you certainly can use lxml to process large data sets - I have just used lxml to process over a million XML documents located in around 480 files (a total of about 5GB of data), and the iterparse() function made it possible to keep memory usage very slim. It stayed at around 70MB, even with the rest of my framework loaded.

kev · Accepted Answer · 2012-03-13 15:20:18Z

2

First of all, your xml is not valid:

\(backslash) in xml tags
<person> tags are not closed

After fix the problems above.
Your problem can be solved by xpath which is provided by lxml module.
BTW, there is a command line tool called xmlstarlet:

$ xmlstarlet sel -t -m '//person' -v 'normalize-space(concat(name, "^", age, "^", address))' -n input.xml
Tom^18^ London, xxx street, xxx building
John^22^ Canberra, xxx street, xxx building, xxx floor, xxx room, xxx bed

edited Mar 13, 2012 at 15:20

answered Mar 13, 2012 at 15:13

kev

163k49 gold badges286 silver badges282 bronze badges

Comments

jfs · Accepted Answer · 2012-03-14 13:10:56Z

Make sure that the input is a well-formed xml.

To process a large xml file with limited memory you could use ElementTree.iterparse():

#!/usr/bin/env python
import xml.etree.cElementTree as etree

def getelements(source, tag):
    context = iter(etree.iterparse(source, events=('start', 'end')))
    _, root = next(context) # get root element
    for event, elem in context:
        if event == 'end' and elem.tag == tag:
            yield elem
            root.clear() # free memory

for elem in getelements('big.xml', 'person'):
    print '^'.join(elem.find(tag).text for tag in 'name age address'.split())

You could add a special processing for a multiline text inside a tag (as in your example for address) e.g., you could use re.sub(r'\s+', ' ', text) to normalize whitespace.

Output

Tom^18^London, xxx street, xxx building
John^22^Canberra, xxx street, xxx building, xxx floor, xxx room, xxx bed

If input xml might contain '^' and you'd like to escape it then you could use csv module to produce output.

Jakub M. · Accepted Answer · 2012-03-13 14:58:41Z

0

Check Beutiful Soup, I use it always for xml and html (although it is a bit slow)

answered Mar 13, 2012 at 14:58

Jakub M.

34.1k48 gold badges117 silver badges184 bronze badges

Collectives™ on Stack Overflow

Converting xml data using Python

be careful about the address!

4 Answers 4

3 Comments

Comments

Output

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

be careful about the address!

4 Answers 4

3 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related