Parsing XML CDATA section and convert it to CSV using ElementTree python

Question

I want to convert XML files into a CSV file. My XML file consists of different tags and I select some of them that are useful for my work. I want to access only text content between TEXT tags. My problem is that I don't know how to access CDATA content. Because TEXT in some DOCs has an IMAGE child, when I run my code it just parses the IMAGE tag and shows NaN when I read my CSV file with pandas. I searched about CDATA but I can't find any tag for it to tell the parser that skips IMAGE tag and extract only content in the CDATA section. Also, I tried to delete IMAGE tags from TEXT to fix the problem but when I did that, it deleted all of the TEXT content, also the CDATA section.

My XML pattern is as follow:

<root>
<DOC>
<TEXT>
<IMAGE>/1379/791012/p18-1.jpg</IMAGE>
<![CDATA[The section I want to access to]]>
</TEXT>
</DOC>
</root>

And, Here is my parsing code:

def make_csv(folderpath, xmlfilename, csvwriter, csv_file):
  
  rows = []

  #Parse XML file
  tree = ET.parse(os.path.join(folderpath, xmlfilename))
  root = tree.getroot()
  
  for elem in root.findall("DOC") :
    rows = []

    sentence = elem.find("TEXT")
    if sentence != None:
        sentence = re.sub('\n', '', sent.text)
    rows.append(sentence)

    csvwriter.writerow(rows)
  csv_file.close()

I appreciate any help.

This XML structure does not particularly look suitable for CSV. There is no way for CSV files to contain NaN so probably you are opening it in something braindead like Excel. Can you edit your question to show the expected output format? (And, looking at your code, probably consider whether that particular format makes sense.) — tripleee
– tripleee, Commented Aug 19, 2021 at 11:44
You have another bug: you are appending to rows and writing out the progressively longer list of rows in each iteration through the loop. Probably just write the sentence as a new row? — tripleee
– tripleee, Commented Aug 19, 2021 at 11:45

balderman · Accepted Answer · 2021-08-19 11:55:28Z

1

My problem is that I don't know how to access CDATA content. Because TEXT in some DOCs has an IMAGE child

The below seems to work. The code handle the cases of IMAGE under TEXT and TEXT with no IMAGE under it.

import xml.etree.ElementTree as ET

xml = '''<?xml version="1.0" encoding="UTF-8"?>
<root>
   <DOC>
      <TEXT>
         <IMAGE>/1379/791012/p18-1.jpg</IMAGE>
         <![CDATA[The section I want to access to]]>
      </TEXT>
      <TEXT>
         <![CDATA[more text]]>
      </TEXT>
   </DOC></root>'''

root = ET.fromstring(xml)
texts = root.findall('.//TEXT')
for idx, text in enumerate(texts, start=1):
    data = list(text)[0].tail.strip() if list(text) else text.text.strip()
    print(f'{idx}) {data}')

output

1) The section I want to access to
2) more text

answered Aug 19, 2021 at 11:55

balderman

24k8 gold badges39 silver badges60 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Parsing XML CDATA section and convert it to CSV using ElementTree python

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related