1

I am doing some research where I have +25,000 reports in one large text-file. Each report is divided by "TEXTSTART[UNIQUE-ID]" and "TEXTEND".

So far I have succeded in reading a single report (that is text between the indentifiers) from the txt-file with this code:

f = open("samples_combined_incomplete.txt","r" )
report = f.read()
f.close()

rstart = "TEXTSTART"
rend = "TEXTEND"

a = ((report.split(rstart))[1].split(rend)[0])
print (a)

My question is this; how can I divide the text-document into uniquely identifiable substrings, based on TEXTSTART[UNIQUE-ID]? And how should the ID be returned?

I am just starting, so any advise on documentation, useful functions, etc. would be much appriciated.


Thank you, works like a charm! The IDs are a combination of numbers and characters FYI.

f = open("samples_combined_incomplete.txt","r" )
report = f.read()
f.close()

rstart = "TEXTSTART"
rend = "TEXTEND"
a = 0

dict = re.findall('TEXTSTART\[(.*?)\](.*?)TEXTEND', report, re.DOTALL)

while a < 10:
    print (dict[a])
    a += 1

If I want to search within the containers for a specific keyword and have the keys returned, how could I do that?

1
  • 1
    have you considered regular expressions? (docs.python.org/2/library/re.html) also, is each of these substrings on a new line? Commented Dec 9, 2012 at 15:53

1 Answer 1

5
import re
print dict(re.findall('TEXTSTART\[([^\]]+)\](.*?)TEXTEND', report, re.DOTALL))
Sign up to request clarification or add additional context in comments.

1 Comment

If the text spans multiple lines, I think this will need re.DOTALL to be specified as an option.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.