How to output CDATA using ElementTree

Question

I've discovered that cElementTree is about 30 times faster than xml.dom.minidom and I'm rewriting my XML encoding/decoding code. However, I need to output XML that contains CDATA sections and there doesn't seem to be a way to do that with ElementTree.

Can it be done?

> I need to output XML that contains CDATA sections Why? It seems a strange requirment. — bortzmeyer
– bortzmeyer, Commented Oct 15, 2008 at 12:14
It's a requirement I have - chunks of CDATA are sometimes much more human-readable. — grifaton
– grifaton, Commented Sep 6, 2010 at 22:39
@bortzmeyer It's useful for adding HTML to KML (Google Maps XML files). — logic-unit
– logic-unit, Commented Jun 23, 2016 at 11:59

elifiner · Accepted Answer · 2008-10-06 16:41:48Z

30

After a bit of work, I found the answer myself. Looking at the ElementTree.py source code, I found there was special handling of XML comments and preprocessing instructions. What they do is create a factory function for the special element type that uses a special (non-string) tag value to differentiate it from regular elements.

def Comment(text=None):
    element = Element(Comment)
    element.text = text
    return element

Then in the _write function of ElementTree that actually outputs the XML, there's a special case handling for comments:

if tag is Comment:
    file.write("<!-- %s -->" % _escape_cdata(node.text, encoding))

In order to support CDATA sections, I create a factory function called CDATA, extended the ElementTree class and changed the _write function to handle the CDATA elements.

This still doesn't help if you want to parse an XML with CDATA sections and then output it again with the CDATA sections, but it at least allows you to create XMLs with CDATA sections programmatically, which is what I needed to do.

The implementation seems to work with both ElementTree and cElementTree.

import elementtree.ElementTree as etree
#~ import cElementTree as etree

def CDATA(text=None):
    element = etree.Element(CDATA)
    element.text = text
    return element

class ElementTreeCDATA(etree.ElementTree):
    def _write(self, file, node, encoding, namespaces):
        if node.tag is CDATA:
            text = node.text.encode(encoding)
            file.write("\n<![CDATA[%s]]>\n" % text)
        else:
            etree.ElementTree._write(self, file, node, encoding, namespaces)

if __name__ == "__main__":
    import sys

    text = """
    <?xml version='1.0' encoding='utf-8'?>
    <text>
    This is just some sample text.
    </text>
    """

    e = etree.Element("data")
    cdata = CDATA(text)
    e.append(cdata)
    et = ElementTreeCDATA(e)
    et.write(sys.stdout, "utf-8")

answered Oct 6, 2008 at 16:41

elifiner

7,6659 gold badges43 silver badges48 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Treviño Over a year ago

This does not seem possible anymore since the write method is not there, and the _serialize* functions are static

elwc Over a year ago

What should I do since I can't use _write? So that means I can't use xml.elementtree? This is terrible.

jsbueno Over a year ago

Thsio reciep won't work for Python 2.7 or 3.2 (and 3.3) - check @amaury's answer bellow. BAsically, teh new ElementTree does not have a "_write" method that can be overriden anymore.

coderek Over a year ago

There is a CDATA element for etree you can use directly. lxml.de/api/lxml.etree.CDATA-class.html

unutbu · Accepted Answer · 2014-12-01 12:30:26Z

21

lxml has support for CDATA and API like ElementTree.

edited Dec 1, 2014 at 12:30

unutbu

886k197 gold badges1.9k silver badges1.7k bronze badges

answered Oct 14, 2008 at 17:43

iny

7,5993 gold badges34 silver badges36 bronze badges

2 Comments

Adam Bethke Over a year ago

This is huge from the "don't roll your own XML parser" perspective.

Peter Moore Over a year ago

@iny I think your lxml link is broken.

Amaury · Accepted Answer · 2012-01-18 18:03:56Z

13

Here is a variant of gooli's solution that works for python 3.2:

import xml.etree.ElementTree as etree

def CDATA(text=None):
    element = etree.Element('![CDATA[')
    element.text = text
    return element

etree._original_serialize_xml = etree._serialize_xml
def _serialize_xml(write, elem, qnames, namespaces):
    if elem.tag == '![CDATA[':
        write("\n<%s%s]]>\n" % (
                elem.tag, elem.text))
        return
    return etree._original_serialize_xml(
        write, elem, qnames, namespaces)
etree._serialize_xml = etree._serialize['xml'] = _serialize_xml


if __name__ == "__main__":
    import sys

    text = """
    <?xml version='1.0' encoding='utf-8'?>
    <text>
    This is just some sample text.
    </text>
    """

    e = etree.Element("data")
    cdata = CDATA(text)
    e.append(cdata)
    et = etree.ElementTree(e)
    et.write(sys.stdout.buffer.raw, "utf-8")

answered Jan 18, 2012 at 18:03

Amaury

2162 silver badges4 bronze badges

3 Comments

jsbueno Over a year ago

This shoudl work fro Python 2.7 as well - as the original recipe does not. I jsut came up with another thing that is mode complicated than this.

Patrick Over a year ago

This needs updating to add the coding kwarg to the _serialize_xml def

Kevin Over a year ago

for python 2.7 add an encoding arg to the serialize signature. change def _serialize_xml(write, elem, qnames, namespaces): to def _serialize_xml(write, elem, encoding, qnames, namespaces): change write, elem, qnames, namespaces) to write, elem, encoding, qnames, namespaces) change et.write(sys.stdout.buffer.raw, "utf-8") to et.write(sys.stdout, "utf-8")

Kamil · Accepted Answer · 2019-10-08 08:59:46Z

Solution:

import xml.etree.ElementTree as ElementTree

def CDATA(text=None):
    element = ElementTree.Element('![CDATA[')
    element.text = text
    return element

ElementTree._original_serialize_xml = ElementTree._serialize_xml
def _serialize_xml(write, elem, qnames, namespaces,short_empty_elements, **kwargs):
    if elem.tag == '![CDATA[':
        write("\n<{}{}]]>\n".format(elem.tag, elem.text))
        if elem.tail:
            write(_escape_cdata(elem.tail))
    else:
        return ElementTree._original_serialize_xml(write, elem, qnames, namespaces,short_empty_elements, **kwargs)

ElementTree._serialize_xml = ElementTree._serialize['xml'] = _serialize_xml

if __name__ == "__main__":
    import sys

text = """
<?xml version='1.0' encoding='utf-8'?>
<text>
This is just some sample text.
</text>
"""

e = ElementTree.Element("data")
cdata = CDATA(text)
root.append(cdata)

Background:

I don't know whether previous versions of proposed code worked very well and whether ElementTree module has been updated but I have faced problems with using this trick:

etree._original_serialize_xml = etree._serialize_xml
def _serialize_xml(write, elem, qnames, namespaces):
    if elem.tag == '![CDATA[':
        write("\n<%s%s]]>\n" % (
                elem.tag, elem.text))
        return
    return etree._original_serialize_xml(
        write, elem, qnames, namespaces)
etree._serialize_xml = etree._serialize['xml'] = _serialize_xml

The problem with this approach is that after passing this exception, serializer is again treating it as normal tag afterwards. I was getting something like:

<textContent>
<![CDATA[this was the code I wanted to put inside of CDATA]]>
<![CDATA[>this was the code I wanted to put inside of CDATA</![CDATA[>
</textContent>

And of course we know that will cause only plenty of errors. Why that was happening though?

The answer is in this little guy:

return etree._original_serialize_xml(write, elem, qnames, namespaces)

We don't want to examine code once again through original serialise function if we have trapped our CDATA and successfully passed it through. Therefore in the "if" block we have to return original serialize function only when CDATA was not there. We were missing "else" before returning original function.

Moreover in my version ElementTree module, serialize function was desperately asking for "short_empty_element" argument. So the most recent version I would recommend looks like this(also with "tail"):

from xml.etree import ElementTree
from xml import etree

#in order to test it you have to create testing.xml file in the folder with the script
xmlParsedWithET = ElementTree.parse("testing.xml")
root = xmlParsedWithET.getroot()

def CDATA(text=None):
    element = ElementTree.Element('![CDATA[')
    element.text = text
    return element

ElementTree._original_serialize_xml = ElementTree._serialize_xml

def _serialize_xml(write, elem, qnames, namespaces,short_empty_elements, **kwargs):

    if elem.tag == '![CDATA[':
        write("\n<{}{}]]>\n".format(elem.tag, elem.text))
        if elem.tail:
            write(_escape_cdata(elem.tail))
    else:
        return ElementTree._original_serialize_xml(write, elem, qnames, namespaces,short_empty_elements, **kwargs)

ElementTree._serialize_xml = ElementTree._serialize['xml'] = _serialize_xml


text = """
<?xml version='1.0' encoding='utf-8'?>
<text>
This is just some sample text.
</text>
"""
e = ElementTree.Element("data")
cdata = CDATA(text)
root.append(cdata)

#tests
print(root)
print(root.getchildren()[0])
print(root.getchildren()[0].text + "\n\nyay!")

The output I got was:

<Element 'Database' at 0x10062e228>
<Element '![CDATA[' at 0x1021cc9a8>

<?xml version='1.0' encoding='utf-8'?>
<text>
This is just some sample text.
</text>


yay!

I wish you the same result!

Thank you! Your solution works great for me in Python 3.4.3, and it's really interesting that you only posted it yesterday, and I need it today. Haven't tested in 3.5, but I guess it will break sooner or later still, probably in the next version. Sigh.
You are welcome. Please keep in mind that always while using ElementTree.parse, you will display only CDATA content (without cdata tag). In my code: 'xmlParsedWithET = ElementTree.parse("testing.xml")'. I figured out how by modifying the code just a little, using lxml you can preserve our precious cdata tags. Let me know if you are interested in that or only standard libs are ok for you
I was writing a generator for my blog and had to assemble an Atom 1.0 feed. This is kind of a one-off task (if it breaks in the future, I can always use a 3.4 virtualenv), so a hack on STL is acceptable to me.

Dan Lenski · Accepted Answer · 2020-05-07 19:09:29Z

7

It's not possible AFAIK... which is a pity. Basically, ElementTree modules assume that the reader is 100% XML compliant, so it shouldn't matter if they output a section as CDATA or some other format that generates the equivalent text.

See this thread on the Python mailing list for more info. Basically, they recommend some kind of DOM-based XML library instead.

edited May 7, 2020 at 19:09

answered Oct 6, 2008 at 16:21

Dan Lenski

80.4k13 gold badges86 silver badges129 bronze badges

4 Comments

bortzmeyer Over a year ago

I would not call it "a pity". For the XML infoset (the content), there is no difference between "<![CDATA[ & ]]>" and "&"... Most XML parsers won't even let you know what was in the original document.

Dan Lenski Over a year ago

That's true, but some data can be dumped and parsed much more efficiently in CDATA format. So it's a pain to not be able to tell an XML library to handle it in this way.

Rahul K P Over a year ago

The link seems like not available now.

Dan Lenski Over a year ago

Thanks. Replaced with Wayback Machine link.

Community · Accepted Answer · 2017-05-23 11:47:22Z

6

Actually this code has a bug, since you don't catch ]]> appearing in the data you are inserting as CDATA

as per Is there a way to escape a CDATA end token in xml?

you should break it into two CDATA's in that case, splitting the ]]> between the two.

basically data = data.replace("]]>", "]]]]><![CDATA[>")
(not necessarily correct, please verify)

edited May 23, 2017 at 11:47

CommunityBot

11 silver badge

answered Nov 26, 2008 at 14:28

Andraz Tori

Comments

Stas Chabarov · Accepted Answer · 2019-10-15 18:26:05Z

6

You can override ElementTree _escape_cdata function:

import xml.etree.ElementTree as ET

def _escape_cdata(text, encoding):
    try:
        if "&" in text:
            text = text.replace("&", "&amp;")
        # if "<" in text:
            # text = text.replace("<", "&lt;")
        # if ">" in text:
            # text = text.replace(">", "&gt;")
        return text
    except TypeError:
        raise TypeError(
            "cannot serialize %r (type %s)" % (text, type(text).__name__)
        )

ET._escape_cdata = _escape_cdata

Note that you may not need pass extra encoding param, depending on your library/python version.

Now you can write CDATA into obj.text like:

root = ET.Element('root')
body = ET.SubElement(root, 'body')
body.text = '<![CDATA[perform extra angle brackets escape for this text]]>'
print(ET.tostring(root))

and get clear CDATA node:

<root>
    <body>
        <![CDATA[perform extra angle brackets escape for this text]]>
    </body>
</root>

edited Oct 15, 2019 at 18:26

answered Oct 15, 2019 at 10:38

Stas Chabarov

611 silver badge3 bronze badges

3 Comments

mzjn Over a year ago

How exactly do I use this to output CDATA sections? What is "contrib version"?

Stas Chabarov Over a year ago

@mzjn thanks, edited. You can use it like usually do with inserting text to ET object. I mean obj.text='<![CDATA[text]]>'. "contrib version" is a library version or a specific python version library (not sure exactly where difference of args num is)

QuinnF Over a year ago

Instead of commenting out those lines, I would just add if text.startswith("<![CDATA[") and text.endswith("]]>"): return text as the first line. That way you don't mess up non-cdata entries

zlalanne · Accepted Answer · 2016-04-06 16:12:06Z

4

This ended up working for me in Python 2.7. Similar to Amaury's answer.

import xml.etree.ElementTree as ET

ET._original_serialize_xml = ET._serialize_xml


def _serialize_xml(write, elem, encoding, qnames, namespaces):
    if elem.tag == '![CDATA[':
        write("<%s%s]]>%s" % (elem.tag, elem.text, elem.tail))
        return
    return ET._original_serialize_xml(
         write, elem, encoding, qnames, namespaces)
ET._serialize_xml = ET._serialize['xml'] = _serialize_xml

edited Apr 6, 2016 at 16:12

user2201041

answered Jun 5, 2013 at 15:37

zlalanne

9245 silver badges11 bronze badges

Comments

Ryabchenko Alexander · Accepted Answer · 2018-09-13 15:15:02Z

for python3 and ElementTree you can use next reciept

import xml.etree.ElementTree as ET

ET._original_serialize_xml = ET._serialize_xml


def serialize_xml_with_CDATA(write, elem, qnames, namespaces, short_empty_elements, **kwargs):
    if elem.tag == 'CDATA':
        write("<![CDATA[{}]]>".format(elem.text))
        return
    return ET._original_serialize_xml(write, elem, qnames, namespaces, short_empty_elements, **kwargs)


ET._serialize_xml = ET._serialize['xml'] = serialize_xml_with_CDATA


def CDATA(text):
   element =  ET.Element("CDATA")
   element.text = text
   return element


my_xml = ET.Element("my_name")
my_xml.append(CDATA("<p>some text</p>")

tree = ElementTree(my_xml)

if you need xml as str, you can use

ET.tostring(tree)

or next hack (which almost same as code inside tostring())

fake_file = BytesIO()
tree.write(fake_file, encoding="utf-8", xml_declaration=True)
result_xml_text = str(fake_file.getvalue(), encoding="utf-8")

and get result

<?xml version='1.0' encoding='utf-8'?>
<my_name>
  <![CDATA[<p>some text</p>]]>
</my_name>

user3155571 · Accepted Answer · 2014-01-03 01:04:33Z

2

I've discovered a hack to get CDATA to work using comments:

node.append(etree.Comment(' --><![CDATA[' + data.replace(']]>', ']]]]><![CDATA[>') + ']]><!-- '))

answered Jan 3, 2014 at 1:04

user3155571

511 silver badge3 bronze badges

Comments

johnpaultthomas · Accepted Answer · 2009-02-04 06:53:24Z

1

The DOM has (atleast in level 2) an interface DATASection, and an operation Document::createCDATASection. They are extension interfaces, supported only if an implementation supports the "xml" feature.

from xml.dom import minidom

my_xmldoc=minidom.parse(xmlfile)

my_xmldoc.createCDATASection(data)

now u have cadata node add it wherever u want....

answered Feb 4, 2009 at 6:53

community wiki

johnpaultthomas

Comments

elwc · Accepted Answer · 2013-01-02 06:56:08Z

1

The accepted solution cannot work with Python 2.7. However, there is another package called lxml which (though slightly slower) shared a largely identical syntax with the xml.etree.ElementTree. lxml is able to both write and parse CDATA. Documentation here

answered Jan 2, 2013 at 6:56

elwc

1,2873 gold badges17 silver badges27 bronze badges

Comments

Michael · Accepted Answer · 2012-05-03 22:42:32Z

Here's my version which is based on both gooli's and amaury's answers above. It works for both ElementTree 1.2.6 and 1.3.0, which use very different methods of doing this.

Note that gooli's does not work with 1.3.0, which seems to be the current standard in Python 2.7.x.

Also note that this version does not use the CDATA() method gooli used either.

import xml.etree.cElementTree as ET

class ElementTreeCDATA(ET.ElementTree):
    """Subclass of ElementTree which handles CDATA blocks reasonably"""

    def _write(self, file, node, encoding, namespaces):
        """This method is for ElementTree <= 1.2.6"""

        if node.tag == '![CDATA[':
            text = node.text.encode(encoding)
            file.write("\n<![CDATA[%s]]>\n" % text)
        else:
            ET.ElementTree._write(self, file, node, encoding, namespaces)

    def _serialize_xml(write, elem, qnames, namespaces):
        """This method is for ElementTree >= 1.3.0"""

        if elem.tag == '![CDATA[':
            write("\n<![CDATA[%s]]>\n" % elem.text)
        else:
            ET._serialize_xml(write, elem, qnames, namespaces)

tom stratton · Accepted Answer · 2012-12-17 17:39:44Z

0

I got here looking for a way to "parse an XML with CDATA sections and then output it again with the CDATA sections".

I was able to do this (maybe lxml has been updated since this post?) with the following: (it is a little rough - sorry ;-). Someone else may have a better way to find the CDATA sections programatically but I was too lazy.

 parser = etree.XMLParser(encoding='utf-8') # my original xml was utf-8 and that was a lot of the problem
 tree = etree.parse(ppath, parser)

 for cdat in tree.findall('./ProjectXMPMetadata'): # the tag where my CDATA lives
   cdat.text = etree.CDATA(cdat.text)

 # other stuff here

 tree.write(opath, encoding="UTF-8",)

answered Dec 17, 2012 at 17:39

tom stratton

6787 silver badges14 bronze badges

Comments

Benjamin Smus · Accepted Answer · 2020-06-30 18:40:31Z

Simple way of making .xml file with CDATA sections

The main idea is that we covert the element tree to a string and call unescape on it. Once we have the string we use standard python to write a string to a file.

Based on: How to write unescaped string to a XML element with ElementTree?

Code that generates the XML file

import xml.etree.ElementTree as ET
from xml.sax.saxutils import unescape

# defining the tree structure
element1 = ET.Element('test1')
element1.text = '<![CDATA[Wired & Forbidden]]>'

# & and <> are in a weird format
string1 = ET.tostring(element1).decode()
print(string1)

# now they are not weird anymore
# more formally, we unescape '&amp;', '&lt;', and '&gt;' in a string of data
# from https://docs.python.org/3.8/library/xml.sax.utils.html#xml.sax.saxutils.unescape
string1 = unescape(string1)
print(string1)

element2 = ET.Element('test2')
element2.text = '<![CDATA[Wired & Forbidden]]>'
string2 = unescape(ET.tostring(element2).decode())
print(string2)

# make the xml file and open in append mode
with open('foo.xml', 'a') as f:
    f.write(string1 + '\n')
    f.write(string2)

Output foo.xml

<test1><![CDATA[Wired & Forbidden]]></test1>
<test2><![CDATA[Wired & Forbidden]]></test2>

Diego Salguero · Accepted Answer · 2023-05-05 18:46:19Z

Combine with unescape https://wiki.python.org/moin/EscapingXml is very easy solution.

import xml.etree.cElementTree as ET
import xml.dom.minidom
from xml.sax.saxutils import unescape

m_encoding = 'UTF-8'

class Xml():

def generate(self, xmlstring):        
    root = ET.Element('info')
    ET.SubElement(root, "foo").text = "<![CDATA[{}]]>".format(xmlstring)
    
    dom = xml.dom.minidom.parseString(ET.tostring(root))
    xml_string = dom.toprettyxml()
    part1, part2 = xml_string.split('?>')
    
    with open(f"xmls/file.xml", 'w', encoding="UTF-8") as xfile:
        file_parts = part1 + 'encoding=\"{}\" standalone="yes"?>'.format(m_encoding) + part2
        xfile.write(unescape(file_parts, {"&apos;": "'", "&quot;": '"'}))
        xfile.close()

Collectives™ on Stack Overflow

How to output CDATA using ElementTree

16 Answers 16

4 Comments

2 Comments

3 Comments

3 Comments

4 Comments

Comments

3 Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Simple way of making .xml file with CDATA sections

Code that generates the XML file

Output foo.xml

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

16 Answers 16

4 Comments

2 Comments

3 Comments

3 Comments

4 Comments

Comments

3 Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Simple way of making .xml file with CDATA sections

Code that generates the XML file

Output foo.xml

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related