Python XML Parsing without root

Question

I wanted to parse a fairly huge xml-like file which doesn't have any root element. The format of the file is:

<tag1>
<tag2>
</tag2>
</tag1>

<tag1>
<tag3/>
</tag1>

What I tried:

tried using ElementTree but it returned a "no root" error. (Is there any other python library which can be used for parsing this file?)
tried adding an extra tag to wrap the entire file and then parse it using Element-Tree. However, I would like to use some more efficient method, in which I would not need to alter the original xml file.

It contains over 3 million useful terms (apart from the tags and other unnecessary data) — sgp
– sgp, Commented May 27, 2014 at 14:38
Approximate file size? Are you looking for time efficiency or memory efficiency? Can the whole file be read into memory? — wwii
– wwii, Commented May 27, 2014 at 15:02

falsetru · Accepted Answer · 2014-05-27 14:51:26Z

10

ElementTree.fromstringlist accepts an iterable (that yields strings).

Using it with itertools.chain:

import itertools
import xml.etree.ElementTree as ET
# import xml.etree.cElementTree as ET

with open('xml-like-file.xml') as f:
    it = itertools.chain('<root>', f, '</root>')
    root = ET.fromstringlist(it)

# Do something with `root`
root.find('.//tag3')

edited May 27, 2014 at 14:51

answered May 27, 2014 at 14:16

falsetru

371k69 gold badges769 silver badges659 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

sgp Over a year ago

I guess this would not be an efficient method in case of large files. Also as I said earlier, I would like to implement this using a different method rather than adding tags to the input.

falsetru Over a year ago

@sgp, I should write f instead of f.read(). updated.; At least this does not read the whole content at once.

sgp Over a year ago

Don't you think that perhaps using a different library would be better? I mean again, you are just adding extra tags to the xml, right? Could you explain me why do you think your method is efficient? Thanks :)

falsetru Over a year ago

@sgp, Because this does not load the whole contents at once as I said in the previous comment. I didn't benchmark the solutions; I can't tell that which one perform better. (BTW, try cElementTree instead of ElementTree)

Community · Accepted Answer · 2017-05-23 12:08:44Z

lxml.html can parse fragments:

from lxml import html
s = """<tag1>
 <tag2>
 </tag2>
</tag1>

<tag1>
 <tag3/>
</tag1>"""
doc = html.fromstring(s)
for thing in doc:
    print thing
    for other in thing:
        print other
"""
>>> 
<Element tag1 at 0x3411a80>
<Element tag2 at 0x3428990>
<Element tag1 at 0x3428930>
<Element tag3 at 0x3411a80>
>>>
"""

Courtesy this SO answer

And if there is more than one level of nesting:

def flatten(nested):
    """recusively flatten nested elements

    yields individual elements
    """
    for thing in nested:
        yield thing
        for other in flatten(thing):
            yield other
doc = html.fromstring(s)
for thing in flatten(doc):
    print thing

Similarly, lxml.etree.HTML will parse this. It adds html and body tags:

d = etree.HTML(s)
for thing in d.iter():
    print thing

""" 
<Element html at 0x3233198>
<Element body at 0x322fcb0>
<Element tag1 at 0x3233260>
<Element tag2 at 0x32332b0>
<Element tag1 at 0x322fcb0>
<Element tag3 at 0x3233148>
"""

nettux · Accepted Answer · 2014-05-27 14:23:56Z

8

How about instead of editing the file do something like this

import xml.etree.ElementTree as ET

with file("xml-file.xml") as f:
    xml_object = ET.fromstringlist(["<root>", f.read(), "</root>"])

edited May 27, 2014 at 14:23

answered May 27, 2014 at 14:10

nettux

5,4444 gold badges26 silver badges34 bronze badges

5 Comments

sgp Over a year ago

In my question, I have said that I already tried this and I would like a better method than that. As the file is large enough, the method you suggested is not an efficient one.

nettux Over a year ago

@sgp You said you edited the original file. That's not what this does. I would like to use some more efficient method, in which I would not need to alter the original xml file ... The original file is unchanged.

sgp Over a year ago

In practice, both are the same methods aren't they? You basically added the extra tag to the string. I would rather like to have a library that doesn't give a "no root error". This method is not a good one because as the file size is large, the process of adding a tag to the string would take some time, thereby causing the inefficiency.

nettux Over a year ago

@sgp See my updated answer I know longer edit the string. This is more efficient than editing the feel. I only write to memory and not to disk.

wwii Over a year ago

@sgp - adding the two tags in this and falsetru's solution is trivial. It adds them on-the-fly - there is no extra iteration over the file contents, it is not constructing a complete new string with the added tags.

Collectives™ on Stack Overflow

Python XML Parsing without root

3 Answers 3

4 Comments

Comments

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related