1

I want to convert XML files into a CSV file. My XML file consists of different tags and I select some of them that are useful for my work. I want to access only text content between TEXT tags. My problem is that I don't know how to access CDATA content. Because TEXT in some DOCs has an IMAGE child, when I run my code it just parses the IMAGE tag and shows NaN when I read my CSV file with pandas. I searched about CDATA but I can't find any tag for it to tell the parser that skips IMAGE tag and extract only content in the CDATA section. Also, I tried to delete IMAGE tags from TEXT to fix the problem but when I did that, it deleted all of the TEXT content, also the CDATA section.

My XML pattern is as follow:

<root>
<DOC>
<TEXT>
<IMAGE>/1379/791012/p18-1.jpg</IMAGE>
<![CDATA[The section I want to access to]]>
</TEXT>
</DOC>
</root>

And, Here is my parsing code:

def make_csv(folderpath, xmlfilename, csvwriter, csv_file):
  
  rows = []

  #Parse XML file
  tree = ET.parse(os.path.join(folderpath, xmlfilename))
  root = tree.getroot()
  
  for elem in root.findall("DOC") :
    rows = []

    sentence = elem.find("TEXT")
    if sentence != None:
        sentence = re.sub('\n', '', sent.text)
    rows.append(sentence)

    csvwriter.writerow(rows)
  csv_file.close()

I appreciate any help.

2
  • This XML structure does not particularly look suitable for CSV. There is no way for CSV files to contain NaN so probably you are opening it in something braindead like Excel. Can you edit your question to show the expected output format? (And, looking at your code, probably consider whether that particular format makes sense.) Commented Aug 19, 2021 at 11:44
  • You have another bug: you are appending to rows and writing out the progressively longer list of rows in each iteration through the loop. Probably just write the sentence as a new row? Commented Aug 19, 2021 at 11:45

1 Answer 1

1

My problem is that I don't know how to access CDATA content. Because TEXT in some DOCs has an IMAGE child

The below seems to work. The code handle the cases of IMAGE under TEXT and TEXT with no IMAGE under it.

import xml.etree.ElementTree as ET

xml = '''<?xml version="1.0" encoding="UTF-8"?>
<root>
   <DOC>
      <TEXT>
         <IMAGE>/1379/791012/p18-1.jpg</IMAGE>
         <![CDATA[The section I want to access to]]>
      </TEXT>
      <TEXT>
         <![CDATA[more text]]>
      </TEXT>
   </DOC></root>'''

root = ET.fromstring(xml)
texts = root.findall('.//TEXT')
for idx, text in enumerate(texts, start=1):
    data = list(text)[0].tail.strip() if list(text) else text.text.strip()
    print(f'{idx}) {data}')

output

1) The section I want to access to
2) more text
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.