How do I write a simple, Python parsing script?

Question

Most of what I do involves writing simple parsing scripts that reads search terms from one file and searches, line by line, another file. Once the search term is found, the line and sometimes the following line are written to another output file. The code I use is rudimentary and likely crude.

#!/usr/bin/env python

data = open("data.txt", "r")
search_terms = ids.read().splitlines()
data.close()
db = open("db.txt", "r")

output = open("output.txt", "w")

for term in search_terms:
    for line in db:
        if line.find(term) > -1:
            next_line = db.next()
            output.write(">" + head + "\n" + next_line)
            print("Found %s" % term)

There are a few problems here. First, I don't think it's the most efficient and fastest to search line by line, but I'm not exactly sure about that. Second, I often run into issues with cursor placement and the cursor doesn't reset to the beginning of the file when the search term is found. Third, while I am usually confident that all of the terms can be found in the db, there are rare times when I can't be sure, so I would like to write to another file whenever it iterates through the entire db and can't find the term. I've tried adding a snippet that counts the number of lines of the db so if the find() function gets to the last line and the term isn't found, then it outputs to another "not found" file, but I haven't been able to get my elif and else loops right.

Overall, I'd just like any hints or corrections that could make this sort of script more efficient and robust.

Thanks.

Code Review

Cory Kramer
– Cory Kramer

2015-01-19 19:23:43 +00:00
Commented Jan 19, 2015 at 19:23 — Cory Kramer
– Cory Kramer, Commented Jan 19, 2015 at 19:23

Andy Kubiak · Accepted Answer · 2015-01-19 20:19:34Z

Unless it's a really big file, why not iterate line by line? If the input file's size is some significant portion of your machine's available resources (memory), then you might want to look into buffered input and other, more low-level abstractions of what the computer is doing. But if you're talking about a few hundred MB or less on a relatively modern machine, let the computer do the computing ;)

Off the bat you might want to get into the habit of using the built-in context manager with. For instance, in your snippet, you don't have a call to output.close().

with open('data.txt', 'r') as f_in:
    search_terms = f_in.read().splitlines()

Now search_terms is a handle to a list that has each line from data.txt as a string (but with the newline characters removed). And data.txt is closed thanks to with.

In fact, I would do that with the db.txt file, also.

with open('db.txt', 'r') as f_in:
    lines = f_in.read().splitlines()

Context managers are cool.

As a side note, you could open your destination file now, and do your parsing and results-tracking with it open the whole time, but I like leaving as many files closed as possible for as long as possible.

I would suggest setting the biggest object on the outside of your loop, which I'm guessing is db.txt contents. The outermost loop only usually only gets iterated once, so might as well put the biggest thing there.

results = []
for i, line in enumerate(lines):
    for term in search_terms:
        if term in line:
            # Use something not likely to appear in your line as a separator
            # for these "second lines". I used three pipe characters, but
            # you could just as easily use something even more random
            results.append('{}|||{}'.format(line, lines[i+1]))

if results:
    with open('output.txt', 'w') as f_out:
        for result in results:
            # Don't forget to replace your custom field separator
            f_out.write('> {}\n'.format(result.replace('|||', '\n')))
else:
    with open('no_results.txt', 'w') as f_out:
        # This will write an empty file to disk
        pass

The nice thing about this approach is each line in db.txt is checked once for each search_term in search_terms. However, the downside is that any line will be recorded for each search term it contains, ie., if it has three search terms in it, that line will appear in your output.txt three times.

And all the files are magically closed.

Context managers are cool.

Good luck!

Dawid Gosławski · Accepted Answer · 2015-01-19 20:19:05Z

search_terms keeps whole data.txt in memory. That it's not good in general but in this case it's not quite bad.

Looking line-by-line is not sufficient but if the case is simple and files are not too big it's not a big deal. If you want more efficiency you should sort data.txt file and put this to some tree-like structure. It depends on data which is inside.

You have to use seek to move pointer back after using next.

Propably the easiest way here is to generate two lists of lines and search using in like:

`db = open('db.txt').readlines()
 db_words = [x.split() for x in db]
 data = open('data.txt').readlines()
 print('Lines in db {}'.format(len(db)))
 for item in db:
     for words in db_words:
         if item in words:
            print("Found {}".format(item))`

Alex Martelli · Accepted Answer · 2015-01-19 21:24:23Z

0

Your key issue is that you may be looping in the wrong order -- in your code as posted, you'll always exhaust the db looking for the first term, so after the first pass of the outer for loop db will be at end, no more lines to read, no other term will ever be found.

Other improvements include using the with statement to guarantee file closure, and a set to track which search terms were not found. (There are also typos in your posted code, such as opening a file as data but then reading it as ids).

So, for example, something like:

with open("data.txt", "r") as data:
    search_terms = data.read().splitlines()

missing_terms = set(search_terms)

with open("db.txt", "r") as db, open("output.txt", "w") as output:
    for line in db:
        for term in search_terms:
            if term in line:
                missing_terms.discard(term)
                next_line = db.next()
                output.write(">" + head + "\n" + next_line)
                print("Found {}".format(term))
                break

if missing_terms:
    diagnose_not_found(missing_terms)

where the diagnose_not_found function does whatever you need to do to warn the user about missing terms.

There are assumptions embedded here, such as the fact that you don't care if some other search term is present in a line where you've found a previous one, or the very next one; they might take substantial work to fix if not applicable and it will require that you edit your Q with a very complete and unambiguous list of specifications.

If your db is actually small enough to comfortably fit in memory, slurping it all in as a list of lines once and for all would allow easier accommodation for more demanding specs (as in that case you can easily go back and forth, while iterating on a file means you can only go forward one line at a time), so if your specs are indeed more demanding please also clarify if this crucial condition hold, or rather you need this script to process potentially humungous db files (say gigabyte-plus sizes, so as to not "comfortably fit in memory", depending on your platform of course).

edited Jan 19, 2015 at 21:24

answered Jan 19, 2015 at 19:52

Alex Martelli

887k175 gold badges1.3k silver badges1.4k bronze badges

3 Comments

Nic Over a year ago

So I should for loop through the db first and then loop within the list? Ya, I've used "with" for automatic closure, just haven't fully incorporated it yet. (I tend to do a lot of copying and pasting from scripts I wrote long ago). Which set operators would help me with that? Thank you for your assistance.

Alex Martelli Over a year ago

@Nic, see my edit, hope it clears things up for you. The only set method really needed is discard (because I've chosen to keep a set of terms that weren't seen yet rather than a set of those that were seen -- in the latter case I'd be using add!-).

Nic Over a year ago

Thanks for all of your help. I've tested it out, and I'm pretty sure I've switched around the loops before. The problem that I always run into is that it finds the first term, writes it to the output file, but then the script finishes without finding anything else. EDIT: I think the "break" needed one more indent. It seems to be working now. Thank you.

Collectives™ on Stack Overflow

How do I write a simple, Python parsing script?

3 Answers 3

Comments

Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related