1

I have a txt file contains more than 100 thousands lines, and for each line I want to create a XML tree. BUT all lines are sharing the same root.

Here the txt file:

LIBRARY:
1,1,1,1,the
1,2,1,1,world
2,1,1,2,we
2,5,2,1,have
7,3,1,1,food

The desired output:

   <LIBRARY>
    <BOOK ID ="1">
        <CHAPTER ID ="1">
            <SENT ID ="1">
                <WORD ID ="1">the</WORD>
            </SENT>
        </CHAPTER>
    </BOOK>
    <BOOK ID ="1">
        <CHAPTER ID ="2">
            <SENT ID ="1">
                <WORD ID ="1">world</WORD>
            </SENT>
        </CHAPTER>
    </BOOK>
    <BOOK ID ="2">
        <CHAPTER ID ="1">
            <SENT ID ="1">
                <WORD ID ="2">we</WORD>
            </SENT>
        </CHAPTER>
    </BOOK>
    <BOOK ID ="2">
        <CHAPTER ID ="5">
            <SENT ID ="2">
                <WORD ID ="1">have</WORD>
            </SENT>
        </CHAPTER>
    </BOOK>
    <BOOK ID ="7">
        <CHAPTER ID ="3">
            <SENT ID ="1">
                <WORD ID ="1">food</WORD>
            </SENT>
        </CHAPTER>
    </BOOK>
</LIBRARY>

I use Element tree for converting txt file to xml file, this is the code I run

def expantree():
  lines = txtfile.readlines()
  for line in lines:
    split_line = line.split(',')
    BOOK.set( 'ID ', split_line[0])
    CHAPTER.set( 'ID ', split_line[1])
    SENTENCE.set( 'ID ', split_line[2])
    WORD.set( 'ID ', split_line[3])
    WORD.text = split_line[4]
    tree = ET.ElementTree(Root)
    tree.write(xmlfile)

Okay, the code is working but i didnt get the desired output, I got the following:

<LIBRARY>
    <BOOK ID ="1">
        <CHAPTER ID ="1">
            <SENT ID ="1">
                <WORD ID ="1">the</WORD>
            </SENT>
        </CHAPTER>
    </BOOK>
</LIBRARY>
<LIBRARY>
    <BOOK ID ="1">
        <CHAPTER ID ="2">
            <SENT ID ="1">
                <WORD ID ="1">world</WORD>
            </SENT>
        </CHAPTER>
    </BOOK>
</LIBRARY>
<LIBRARY>
    <BOOK ID ="2">
        <CHAPTER ID ="1">
            <SENT ID ="1">
                <WORD ID ="2">we</WORD>
            </SENT>
        </CHAPTER>
    </BOOK>
</LIBRARY>
<LIBRARY>
    <BOOK ID ="2">
        <CHAPTER ID ="5">
            <SENT ID ="2">
                <WORD ID ="1">have</WORD>
            </SENT>
        </CHAPTER>
    </BOOK>
</LIBRARY>
<LIBRARY>
    <BOOK ID ="7">
        <CHAPTER ID ="3">
            <SENT ID ="1">
                <WORD ID ="1">food</WORD>
            </SENT>
        </CHAPTER>
    </BOOK>
</LIBRARY>

How to unify the tree root , so instead of getting many root tag I get one root tag?

3 Answers 3

1

Another option which is perhaps more succinct is as follows:

from xml.etree import ElementTree as ET
import io
import os

# Setup the test input
inbuf = io.StringIO(''.join(['LIBRARY:\n', '1,1,1,1,the\n', '1,2,1,1,world\n',
                             '2,1,1,2,we\n', '2,5,2,1,have\n', '7,3,1,1,food\n']))

tags = ['BOOK', 'CHAPTER', 'SENT', 'WORD']
with inbuf as into, io.StringIO() as xmlfile:
    root_name = into.readline()
    root = ET.ElementTree(ET.Element(root_name.rstrip(':\n')))
    re = root.getroot()
    for line in into:
        values = line.split(',')
        parent = re
        for i, v in enumerate(values[:4]):
            parent =  ET.SubElement(parent, tags[i], {'ID': v})
            if i == 3:
                parent.text = values[4].rstrip(':\n')
    root.write(xmlfile, encoding='unicode', xml_declaration=True)
    xmlfile.seek(0, os.SEEK_SET)
    for line in xmlfile:
        print(line) 

What this code does is to construct an ElementTree from the input data and write it to a file-like object as an XML file. This code will work either with the standard Python xml.etree package or with lxml. The code was tested using Python 3.3.

Sign up to request clarification or add additional context in comments.

Comments

1

Here is a suggestion that uses lxml (tested with Python 2.7). The code can easily be adapted to work with ElementTree too, but it's harder to get nice pretty-printed output (see https://stackoverflow.com/a/16377996/407651 for some more on this).

The input file is library.txt and the output file is library.xml.

from lxml import etree

lines = open("library.txt").readlines()
library = etree.Element('LIBRARY')   # The root element 

# For each line with data in the input file, create a BOOK/CHAPTER/SENT/WORD structure
for line in lines:
    values = line.split(',')
    if len(values) == 5:
        book = etree.SubElement(library, "BOOK")
        book.set("ID", values[0])
        chapter = etree.SubElement(book, "CHAPTER")
        chapter.set("ID", values[1])
        sent = etree.SubElement(chapter, "SENT")
        sent.set("ID", values[2])
        word = etree.SubElement(sent, "WORD")
        word.set("ID", values[3])
        word.text = values[4].strip()

etree.ElementTree(library).write("library.xml", pretty_print=True)

1 Comment

I upvoted, but since SubElement allows attributes to be set as in book = etree.SubElement(library, 'BOOK', ID=values[0]), the set() operations can be eliminated.
0

One method would be to create the full tree and print it. I used the following code:

from lxml import etree as ET

def create_library(lines):
    library = ET.Element('LIBRARY')
    for line in lines:
        split_line = line.split(',')
        library.append(create_book(split_line))
    return library

def create_book(split_line):
    book = ET.Element('BOOK',ID=split_line[0])
    book.append(create_chapter(split_line))
    return book

def create_chapter(split_line):
    chapter = ET.Element('CHAPTER',ID=split_line[1])
    chapter.append(create_sentence(split_line))
    return chapter

def create_sentence(split_line):
    sentence = ET.Element('SENT',ID=split_line[2])
    sentence.append(create_word(split_line))
    return sentence

def create_word(split_line):
    word = ET.Element('WORD',ID=split_line[3])
    word.text = split_line[4]
    return word

Then your code to create the file would look like:

def expantree():
    lines = txtfile.readlines()
    library = create_library(lines)
    ET.ElementTree(lib).write(xmlfile)

If you don't want to load the entire tree in memory (you mentioned there are more than 100 thousand lines), you can manually create the tag, write each book one at a time, then add the tag. In this case your code would look like:

def expantree():
    lines = txtfile.readlines()
    f = open(xmlfile,'wb')
    f.write('<LIBRARY>')
    for line in lines:
        split_line = line.split(',')
        book = create_book(split_line)
        f.write(ET.tostring(book))
    f.write('</LIBRARY>')
    f.close()

I don't have that much experience with lxml, so there may be more elegant solutions, but both of these work.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.