1

I am attempting to demonstrate functionality for finding/replacing XML attributes, similar to that posed in a related question (Find and Replace XML Attributes by Indexing - Python), but for content contained within a CDATA string. Specifically, I would like to know if it is possible to find and replace CDATA attribute values with new values via indexing. I am attempting to replace the first and second attribute values within the first set of 'td' subelements, and also the second and third attribute values for the second set of 'td' subelements. Below is the XML, along with the script I am using and the new values to be added to the desired output XML:

The XML ("foo_bar_CDATA.xml"):

<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://www.opengis.net/kml/2.2" xmlns:gx="http://www.google.com/kml/ext/2.2">
<Overlay>
    <description>
    <![CDATA[
    <html>
    <head>
        <body>
            <div id="view">
                <div class="item">
                    <tr id="source">
                        <td class="raster">Source</td>
                        <td class="number">1800</td>
                        <td class="number">2100</td>
                    </tr>
                    <tr id="preview">
                        <td class="raster">Preview</td>
                        <td class="number">1100</td>
                        <td class="number">1500</td>
                    </tr>
                </div>
            </div>
        </body>
    </head>
    </html>
    ]]>
    </description>   
</Overlay></kml>

The script:

import lxml.etree as ET
xml = ET.parse("C:\\Users\\mdl518\\Desktop\\bar_foo_CDATA.xml")
tree=xml.getroot().getchildren()[0][1]

val_1 = 1900
val_2 = 2000
val_3 = 3000
val_4 = 4000

# Find and replace the "td" subelement attribute values with the new values (val_"x") 
for elem in tree.getiterator():
    if elem.text:
        elem.text=elem.text.replace('Source',val_1)
    if elem.text:
        elem.text=elem.text.replace('1800',val_2)
    if elem.text:
        elem.text=elem.text.replace('1100',val_3)
    if elem.text:
        elem.text=elem.text.replace('1500',val_4)
    print(elem.text)

    output = ET.tostring(tree, 
                 encoding="UTF-8",
                 method="xml", 
                 xml_declaration=True, 
                 pretty_print=True)

    print(output.decode("utf-8"))

The Desired Output XML:

<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://www.opengis.net/kml/2.2" xmlns:gx="http://www.google.com/kml/ext/2.2">
<Overlay>
    <description>
    <![CDATA[
    <html>
    <head>
        <body>
            <div id="view">
                <div class="item">
                    <tr id="source">
                        <td class="raster">1900</td>
                        <td class="number">2000</td>
                        <td class="number">2100</td>
                    </tr>
                    <tr id="preview">
                        <td class="raster">Preview</td>
                        <td class="number">3000</td>
                        <td class="number">4000</td>
                    </tr>
                </div>
            </div>
        </body>
    </head>
    </html>
    ]]>
    </description>   
</Overlay></kml>

My main issue is correctly indexing/reading the attributes vs. hard-coding the desired values, as indexing them properly to find/replace with new values would be ideal. The above approach appears viable for XMLs without CDATA strings, but I cannot determine how to correctly parse the CDATA content, including properly writing of the XML to a file. Additionally, the opening and closing tags (<, >) are being incorrectly written as &gt and &lt within the XML. Any assistance is most appreciated!

0

1 Answer 1

1

Since the CDATA is an HTML string, I would extract it out of the XML, make changes to it and then reinsert it in the xml:

#first edit
cd = etree.fromstring(doc.xpath('//*[local-name()="description"]')[0].text) #out of the XML

vals = ["1900","2000","3000","4000"]
rems = ["Source","1800","1100","1500"]
targets = cd.xpath('//tr//td')
for target in targets:
    if target.text in rems:
        target.text=vals[rems.index(target.text)]
#second edit
doc.xpath('//*[local-name()="description"]')[0].text = etree.CDATA(etree.tostring(cd)) #... and back into the XML as CDATA
    
print(ET.tostring(tree).decode())

The output should be your expected output.

Sign up to request clarification or add additional context in comments.

4 Comments

Thanks, @Jack, your update works beautifully! I did, however, add one item to my original post and that is pertaining to referencing namespaces...I regretfully neglected to include this unaware of it's significance. I am now attempting to tweak the 'cd' function, referencing a namespace 'ns' function as ns = {'kml': 'opengis.net/kml/2.2'} and cd = ET.fromstring(tree.xpath('//kml:description')[0].text, namespaces=ns), but now get an "IndexError: list index is out of range". I think this just needs a minor tweak and the full solution will be working - Thanks again!!
@mdl518 Ah,the dreaded namespaces... There are a couple of ways of handling them. I used one of them (local-name()) in the edits.
Thanks, @Jack, you are the MAN!! The updated solution is perfect, it even handles the dreaded namespaces no problem! Is there otherwise a way to reference the attribute text values (i.e. the "rems") via indexing as opposed to hard coding them into a list? I will otherwise confirm your updates as the correct solution, many thanks!
@mdl518 Yes, there is, but you should probably post it as a separate question, per SO policy.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.