I want to convert XML files into a CSV file. My XML file consists of different tags and I select some of them that are useful for my work. I want to access only text content between TEXT tags. My problem is that I don't know how to access CDATA content. Because TEXT in some DOCs has an IMAGE child, when I run my code it just parses the IMAGE tag and shows NaN when I read my CSV file with pandas. I searched about CDATA but I can't find any tag for it to tell the parser that skips IMAGE tag and extract only content in the CDATA section. Also, I tried to delete IMAGE tags from TEXT to fix the problem but when I did that, it deleted all of the TEXT content, also the CDATA section.
My XML pattern is as follow:
<root>
<DOC>
<TEXT>
<IMAGE>/1379/791012/p18-1.jpg</IMAGE>
<![CDATA[The section I want to access to]]>
</TEXT>
</DOC>
</root>
And, Here is my parsing code:
def make_csv(folderpath, xmlfilename, csvwriter, csv_file):
rows = []
#Parse XML file
tree = ET.parse(os.path.join(folderpath, xmlfilename))
root = tree.getroot()
for elem in root.findall("DOC") :
rows = []
sentence = elem.find("TEXT")
if sentence != None:
sentence = re.sub('\n', '', sent.text)
rows.append(sentence)
csvwriter.writerow(rows)
csv_file.close()
I appreciate any help.
rowsand writing out the progressively longer list of rows in each iteration through the loop. Probably just write thesentenceas a new row?