0

I have a large xml data. The format is given as following:

  <person>
        <name>Tom</name>
        <age>18</age>
        <address> London, xxx street, xxx building</address>
  </person>
  <person>
        <name>John</name>
        <age>22</age>
        <address> Canberra, xxx street, xxx building, xxx floor, xxx room,
                 xxx  bed</address>
  </person>

be careful about the address!

I want to get the following result:

  Tom^18^London, xxx street, xxx building
  John^22^Canberra, xxx street, xxx building, xxx floor, xxx room,xxx  bed

The data is fairly large, and I hope I can read line by line thus there will not be the problem of memory. Thanks.

2
  • I don't know why it looks in this way, however, I get the data in this format. Commented Mar 13, 2012 at 15:04
  • 3
    That's not XML. You need to tell whoever is sending you this kind of data to fix his "xml" generation code so it actually produces XML. Commented Mar 13, 2012 at 15:04

4 Answers 4

6

Don't torture yourself by parsing strings if there is an xml library available -- lxml for example. If you want to stick with python's batteries included then have a look at dom.minidom.

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks. I want to know whether I could you the lxml, since the data is very large. Using python to read line by line, it's possible to do it.
@FrankWANG - For streaming XML, take a look at sax (wiki.python.org/moin/Sax)
@FrankWANG - you certainly can use lxml to process large data sets - I have just used lxml to process over a million XML documents located in around 480 files (a total of about 5GB of data), and the iterparse() function made it possible to keep memory usage very slim. It stayed at around 70MB, even with the rest of my framework loaded.
2

First of all, your xml is not valid:

  • \(backslash) in xml tags
  • <person> tags are not closed

After fix the problems above.
Your problem can be solved by xpath which is provided by lxml module.
BTW, there is a command line tool called xmlstarlet:

$ xmlstarlet sel -t -m '//person' -v 'normalize-space(concat(name, "^", age, "^", address))' -n input.xml
Tom^18^ London, xxx street, xxx building
John^22^ Canberra, xxx street, xxx building, xxx floor, xxx room, xxx bed

Comments

1

Make sure that the input is a well-formed xml.

To process a large xml file with limited memory you could use ElementTree.iterparse():

#!/usr/bin/env python
import xml.etree.cElementTree as etree

def getelements(source, tag):
    context = iter(etree.iterparse(source, events=('start', 'end')))
    _, root = next(context) # get root element
    for event, elem in context:
        if event == 'end' and elem.tag == tag:
            yield elem
            root.clear() # free memory

for elem in getelements('big.xml', 'person'):
    print '^'.join(elem.find(tag).text for tag in 'name age address'.split())

You could add a special processing for a multiline text inside a tag (as in your example for address) e.g., you could use re.sub(r'\s+', ' ', text) to normalize whitespace.

Output

Tom^18^London, xxx street, xxx building
John^22^Canberra, xxx street, xxx building, xxx floor, xxx room, xxx bed

If input xml might contain '^' and you'd like to escape it then you could use csv module to produce output.

Comments

0

Check Beutiful Soup, I use it always for xml and html (although it is a bit slow)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.