11

I wanted to parse a fairly huge xml-like file which doesn't have any root element. The format of the file is:

<tag1>
<tag2>
</tag2>
</tag1>

<tag1>
<tag3/>
</tag1>

What I tried:

  1. tried using ElementTree but it returned a "no root" error. (Is there any other python library which can be used for parsing this file?)
  2. tried adding an extra tag to wrap the entire file and then parse it using Element-Tree. However, I would like to use some more efficient method, in which I would not need to alter the original xml file.
3
  • How large is the file? Commented May 27, 2014 at 14:36
  • It contains over 3 million useful terms (apart from the tags and other unnecessary data) Commented May 27, 2014 at 14:38
  • Approximate file size? Are you looking for time efficiency or memory efficiency? Can the whole file be read into memory? Commented May 27, 2014 at 15:02

3 Answers 3

10

ElementTree.fromstringlist accepts an iterable (that yields strings).

Using it with itertools.chain:

import itertools
import xml.etree.ElementTree as ET
# import xml.etree.cElementTree as ET

with open('xml-like-file.xml') as f:
    it = itertools.chain('<root>', f, '</root>')
    root = ET.fromstringlist(it)

# Do something with `root`
root.find('.//tag3')
Sign up to request clarification or add additional context in comments.

4 Comments

I guess this would not be an efficient method in case of large files. Also as I said earlier, I would like to implement this using a different method rather than adding tags to the input.
@sgp, I should write f instead of f.read(). updated.; At least this does not read the whole content at once.
Don't you think that perhaps using a different library would be better? I mean again, you are just adding extra tags to the xml, right? Could you explain me why do you think your method is efficient? Thanks :)
@sgp, Because this does not load the whole contents at once as I said in the previous comment. I didn't benchmark the solutions; I can't tell that which one perform better. (BTW, try cElementTree instead of ElementTree)
10

lxml.html can parse fragments:

from lxml import html
s = """<tag1>
 <tag2>
 </tag2>
</tag1>

<tag1>
 <tag3/>
</tag1>"""
doc = html.fromstring(s)
for thing in doc:
    print thing
    for other in thing:
        print other
"""
>>> 
<Element tag1 at 0x3411a80>
<Element tag2 at 0x3428990>
<Element tag1 at 0x3428930>
<Element tag3 at 0x3411a80>
>>>
"""

Courtesy this SO answer

And if there is more than one level of nesting:

def flatten(nested):
    """recusively flatten nested elements

    yields individual elements
    """
    for thing in nested:
        yield thing
        for other in flatten(thing):
            yield other
doc = html.fromstring(s)
for thing in flatten(doc):
    print thing

Similarly, lxml.etree.HTML will parse this. It adds html and body tags:

d = etree.HTML(s)
for thing in d.iter():
    print thing

""" 
<Element html at 0x3233198>
<Element body at 0x322fcb0>
<Element tag1 at 0x3233260>
<Element tag2 at 0x32332b0>
<Element tag1 at 0x322fcb0>
<Element tag3 at 0x3233148>
"""

Comments

8

How about instead of editing the file do something like this

import xml.etree.ElementTree as ET

with file("xml-file.xml") as f:
    xml_object = ET.fromstringlist(["<root>", f.read(), "</root>"])

5 Comments

In my question, I have said that I already tried this and I would like a better method than that. As the file is large enough, the method you suggested is not an efficient one.
@sgp You said you edited the original file. That's not what this does. I would like to use some more efficient method, in which I would not need to alter the original xml file ... The original file is unchanged.
In practice, both are the same methods aren't they? You basically added the extra tag to the string. I would rather like to have a library that doesn't give a "no root error". This method is not a good one because as the file size is large, the process of adding a tag to the string would take some time, thereby causing the inefficiency.
@sgp See my updated answer I know longer edit the string. This is more efficient than editing the feel. I only write to memory and not to disk.
@sgp - adding the two tags in this and falsetru's solution is trivial. It adds them on-the-fly - there is no extra iteration over the file contents, it is not constructing a complete new string with the added tags.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.