Replacing and adding text in text files in Python

Question

I need to convert .ass subtitles file into .xml file. So far I did it by hand, but I have to do more and more of it.

That's how process looks like:

Input .ass file:

Dialogue: 0,0:00:08.03,0:00:10.57,Default,,0000,0000,0000,,Actor says something
Dialogue: 0,0:00:11.28,0:00:21.05,Default,,0000,0000,0000,,Actor says something
etc.

Output .xml file:

<p begin="00:00:08.03" end="00:00:10.57">Actor says something</p>
<p begin="00:00:11.28" end="00:00:21.05">Actor says something</p>
etc.

I dodn't know how to solve this task.

I think you can use the csv module to parse the input (provided you remove the "Dialogue" thing at the beginning), and then use an xml emitter to get your output (xml.dom.minidom should do the trick). — Thomas Orozco
– Thomas Orozco, Commented Aug 17, 2012 at 11:32
@ThomasOrozco: You don't even need to remove the "Dialogue". — Tim Pietzcker
– Tim Pietzcker, Commented Aug 17, 2012 at 11:44

sloth · Accepted Answer · 2012-08-17 11:57:59Z

First, you should extract the relevant information from your source file. As the data is ,-seperated, you can use the python csv module or do a simple split(',').

This is an example method of how it could look like:

def extract(source):
    for line in iter(source):
        _, start, end, _, _, _, _, _, _, text = line.strip().split(',', 9)
        yield start, end, text

The next step if to convert your extracted data to the desired xml format. A function that plays well with the data from the first method could look like this (using simple string formatting):

xml = '<p begin="{start}" end="{end}">{text}</p>'
def to_xml(start, end, text):
    return xml.format(start=start, end=end, text=text)

Finally, opening the files and use the methods to write your output:

with open('input.ass') as infile, open('output.xml', 'w') as outfile:
    for start, end, text in extract(infile):
        outfile.write(to_xml(start, end, text) + '\n')

While you could of course make this smaller (less LOC), it is a readable approach IMHO.

dugres · Accepted Answer · 2012-08-17 11:51:05Z

0

src = [
'Dialogue: 0,0:00:08.03,0:00:10.57,Default,,0000,0000,0000,,Actor says something',
'Dialogue: 0,0:00:11.28,0:00:21.05,Default,,0000,0000,0000,,Actor says something',
]
tpl = '<p begin="0%s" end="0%s">%s</p>'
for i in src:
    fields = i.split(',')
    start, end, txt = fields[1], fields[2], fields[-1]
    print tpl % (start, end, txt)

answered Aug 17, 2012 at 11:51

dugres

13.2k8 gold badges48 silver badges52 bronze badges

1 Comment

Tim Pietzcker Over a year ago

What if the subtitle itself contains a comma?

Tim Pietzcker · Accepted Answer · 2012-08-17 11:53:52Z

0

Quick and dirty:

>>> subs = """Dialogue: 0,0:00:08.03,0:00:10.57,Default,,0000,0000,0000,,Actor s
ays something, then some more
... Dialogue: 0,0:00:11.28,0:00:21.05,Default,,0000,0000,0000,,Actor says someth
ing"""
>>> for line in subs.split("\n"):
...     print('<p begin="{0[1]}" end="{0[2]}">{0[9]}</p>'.format(
...            line.split(",", 9))) # Split no more than 9 times
...
<p begin="0:00:08.03" end="0:00:10.57">Actor says something, then some more</p>
<p begin="0:00:11.28" end="0:00:21.05">Actor says something</p>

edited Aug 17, 2012 at 11:53

answered Aug 17, 2012 at 11:48

Tim Pietzcker

337k59 gold badges520 silver badges572 bronze badges

3 Comments

Dmitry Zagorulkin Over a year ago

this is not will be parsed because you did't add root element.

Tim Pietzcker Over a year ago

@ZagorulkinDmitry: Of course, but his sample output didn't either, so I assumed he would wrap it in a root element himself. After all, that's trivial.

Dmitry Zagorulkin Over a year ago

thank you. i thought that topic starter did't know about xml validation.

Collectives™ on Stack Overflow

Replacing and adding text in text files in Python

3 Answers 3

Comments

1 Comment

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

1 Comment

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related