1

I need to convert .ass subtitles file into .xml file. So far I did it by hand, but I have to do more and more of it.

That's how process looks like:

Input .ass file:

Dialogue: 0,0:00:08.03,0:00:10.57,Default,,0000,0000,0000,,Actor says something
Dialogue: 0,0:00:11.28,0:00:21.05,Default,,0000,0000,0000,,Actor says something
etc.

Output .xml file:

<p begin="00:00:08.03" end="00:00:10.57">Actor says something</p>
<p begin="00:00:11.28" end="00:00:21.05">Actor says something</p>
etc.

I dodn't know how to solve this task.

2
  • 1
    I think you can use the csv module to parse the input (provided you remove the "Dialogue" thing at the beginning), and then use an xml emitter to get your output (xml.dom.minidom should do the trick). Commented Aug 17, 2012 at 11:32
  • @ThomasOrozco: You don't even need to remove the "Dialogue". Commented Aug 17, 2012 at 11:44

3 Answers 3

1

First, you should extract the relevant information from your source file. As the data is ,-seperated, you can use the python csv module or do a simple split(',').

This is an example method of how it could look like:

def extract(source):
    for line in iter(source):
        _, start, end, _, _, _, _, _, _, text = line.strip().split(',', 9)
        yield start, end, text

The next step if to convert your extracted data to the desired xml format. A function that plays well with the data from the first method could look like this (using simple string formatting):

xml = '<p begin="{start}" end="{end}">{text}</p>'
def to_xml(start, end, text):
    return xml.format(start=start, end=end, text=text)

Finally, opening the files and use the methods to write your output:

with open('input.ass') as infile, open('output.xml', 'w') as outfile:
    for start, end, text in extract(infile):
        outfile.write(to_xml(start, end, text) + '\n')

While you could of course make this smaller (less LOC), it is a readable approach IMHO.

Sign up to request clarification or add additional context in comments.

Comments

0
src = [
'Dialogue: 0,0:00:08.03,0:00:10.57,Default,,0000,0000,0000,,Actor says something',
'Dialogue: 0,0:00:11.28,0:00:21.05,Default,,0000,0000,0000,,Actor says something',
]
tpl = '<p begin="0%s" end="0%s">%s</p>'
for i in src:
    fields = i.split(',')
    start, end, txt = fields[1], fields[2], fields[-1]
    print tpl % (start, end, txt)

1 Comment

What if the subtitle itself contains a comma?
0

Quick and dirty:

>>> subs = """Dialogue: 0,0:00:08.03,0:00:10.57,Default,,0000,0000,0000,,Actor s
ays something, then some more
... Dialogue: 0,0:00:11.28,0:00:21.05,Default,,0000,0000,0000,,Actor says someth
ing"""
>>> for line in subs.split("\n"):
...     print('<p begin="{0[1]}" end="{0[2]}">{0[9]}</p>'.format(
...            line.split(",", 9))) # Split no more than 9 times
...
<p begin="0:00:08.03" end="0:00:10.57">Actor says something, then some more</p>
<p begin="0:00:11.28" end="0:00:21.05">Actor says something</p>

3 Comments

this is not will be parsed because you did't add root element.
@ZagorulkinDmitry: Of course, but his sample output didn't either, so I assumed he would wrap it in a root element himself. After all, that's trivial.
thank you. i thought that topic starter did't know about xml validation.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.