How to remove elements from XML using Python

Question

I got stuck with XML and Python. The task is simple but I couldn't resolve it so far and spent on that long time. I came here for an advice how to solve it with couple of lines.

Thanks for any help with traversing the tree. I always ended up with too many or too few elements. Elements can be nested without limit. Given example is just an example. I will accept any solution, not picky about dom, minidom, sax, whatever..

I have an XML file similar to this one:

<root>
    <elm>
        <elm>Common content</elm>

        <elm xmlns="http://example.org/ns">
            <elm lang="en">Content EN</elm>
            <elm lang="cs">žluťoučký koníček</elm>
        </elm>

        <elm xml:id="abc123">Common content</elm>

        <elm lang="en">Content EN</elm>
        <elm lang="cs">Content CS</elm>

        <elm lang="en">
            <elm>Content EN</elm>
            <elm>Content EN</elm>
        </elm>

        <elm lang="cs">
            <elm>Content CS</elm>
            <elm>Content CS</elm>
        </elm>
    </elm>
</root>

What I need - parse the XML and write a new file. The new file should contain all the elements for given language and elements without lang attribute.

For "cs" language the output file should containt this:

<root>
    <elm>
        <elm>Common content</elm>

        <elm xmlns="http://example.org/ns">
            <elm lang="cs">žluťoučký koníček</elm>
        </elm>

        <elm xml:id="abc123">Common content</elm>

        <elm lang="cs">Content CS</elm>

        <elm lang="cs">
            <elm>Content CS</elm>
            <elm>Content CS</elm>
        </elm>
    </elm>
</root>

If you can make it to omit the lang attribute in the new file, even better. But it's not that important.

UPDATE1: Added unicode characters and namespace attribute.

UPDATE2: Using Python 2.5, standard libraries preferred.

For "en" language the output file should containt this: I assume you meant to say that the given output is for "cs" language? — LarsH
– LarsH, Commented Aug 31, 2010 at 18:21
@LarsH: I updated the question to add some unicode characters there. You're right, there should be written: for "cs" language. Will change it. — dwich
– dwich, Commented Aug 31, 2010 at 22:27

Gal Bracha · Accepted Answer · 2016-04-16 12:13:51Z

15

Using lxml:

import lxml.etree as le

with open('doc.xml','r') as f:
    doc=le.parse(f)
    for elem in doc.xpath('//*[attribute::lang]'):
        if elem.attrib['lang']=='en':
            elem.attrib.pop('lang')
        else:
            parent=elem.getparent()
            parent.remove(elem)
    print(le.tostring(doc))

yields

<root>
    <elm>Common content</elm>

    <elm>
        <elm>Content EN</elm>
        </elm>

    <elm>Common content</elm>

    <elm>Content EN</elm>
    <elm>
        <elm>Content EN</elm>
        <elm>Content EN</elm>
    </elm>

    </root>

edited Apr 16, 2016 at 12:13

Gal Bracha

20.2k11 gold badges82 silver badges89 bronze badges

answered Aug 29, 2010 at 2:18

unutbu

886k197 gold badges1.9k silver badges1.7k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

dwich Over a year ago

Thanks a lot. Can't install lxml on my WinXP, problem with compiler. Will give it a try later.

dwich Over a year ago

Works! Thanks! You saved my night :) I thank both of you, both solutions are good.

Alex Martelli · Accepted Answer · 2010-08-29 02:10:07Z

6

I'm not sure how best to remove the lang attribute, but here's some code that does the other changes (Python 2.7; for 2.5 or 2.6, use getIterator instead of iter), assuming that when you remove an element you also always want to remove everything contained in that element.

This code just prints the result to standard output (you could redirect it as you wish, of course, or directly write it to some new file, and so on):

import sys
from xml.etree import cElementTree as et

def picklang(path, lang='en'):
    tr = et.parse(path)
    for element in tr.iter():
        for subelement in element:
            la = subelement.get('lang')
            if la is not None and la != lang:
                element.remove(subelement)
    return tr

if __name__ == '__main__':
    tr = picklang('la.xml')
    tr.write(sys.stdout)
    print

With la.xml being your example, this writes

<root>
    <elm>Common content</elm>

    <elm>
        <elm lang="en">Content EN</elm>
        </elm>

    <elm>Common content</elm>

    <elm lang="en">Content EN</elm>
    <elm lang="en">
        <elm>Content EN</elm>
        <elm>Content EN</elm>
    </elm>

    </root>

answered Aug 29, 2010 at 2:10

Alex Martelli

887k175 gold badges1.3k silver badges1.4k bronze badges

4 Comments

dwich Over a year ago

Thank you Alex, works great. Except two things - namespace and unicode. If there's an xmlns attribute, for example <elm xmlns="http://example.org/ns">, the new node itself gets an xmlns:ns0="http://example.org/ns" attribute and all the child nodes get an <ns0: prefix. These prefixes are not present in the source file. Also cannot force write() method to write unicode characters in their original form. I'll update the example file.

Alex Martelli Over a year ago

@dwich, for the writing you can just add to the write call an encoding parameter of your choice. Aesthetics such as the namespace issue (which I believe don't change the semantics of the XML) are much ticklier to deal with, alas (just like, e.g., you may have noticed, the indentation in the output is different, because whitespace in elements being removed also goes away).

dwich Over a year ago

That unicode thing was my mistake, I started playing with codecs and even though I used encoding='utf-8', it didn't work (coz of opening it incorrectly). Thank you for your answer, I will pick ~unutbu`s solution as his code doesn't have problems with the namespace thing. Both answers are correct. Thank you guys!

Alex Martelli Over a year ago

@dwich, I agree with you - @unutbu's answer is better (if you can use third party packages like lxml), among other things because it does remove the attribute, as you ideally desired, while mine, as I mentioned, didn't.

bhuvi · Accepted Answer · 2015-11-17 20:28:08Z

updating @Alex Martelli's code to remove a bug where the element list is updated in place. Above solution will give wrong answer if the input is little more complex.

import sys
from xml.etree import cElementTree as et

def picklang(path, lang='en'):
    tr = et.parse(path)
    for element in tr.iter():
        for subelement in element[:]:
            la = subelement.get('lang')

            if la is not None and la != lang:
                element.remove(subelement)
    return tr

if __name__ == '__main__':
    tr = picklang('la.xml')
    tr.write(sys.stdout)
    print

Code in line 7 for subelement in element: is changed to for subelement in element[:]: as it is incorrect to update list in place while iterating over it.

This code iterates over a copy of element list and removes elements when lang != "en" in the original element list.

Collectives™ on Stack Overflow

How to remove elements from XML using Python

3 Answers 3

2 Comments

4 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related