0

I have a text file called dna.txt which contains:

>A
ACG
>B
CCG
>C
CCG
>D
TCA

I want to create a program using Python that will compare all the lines of the text file after the first sequence (ACG) to the first sequence (ACG), and print out "conserved" if the sequences are a match, and "not conserved" if the sequences are a mismatch. I did it using an extremely inefficient way that only goes up to 30 sequences in the file, and I was wondering how maybe a loop could be utilized to simplify this block of code. This is just a short sample of the inefficient method I used:

f = open("dna.txt")
sequence_1 = linecache.getline('dna.txt', 2)
sequence_2 = linecache.getline('dna.txt', 4)
sequence_3 = linecache.getline('dna.txt', 6)
sequence_x = linecache.getline('dna.txt', 2x)
f.close()
if sequence_2 == sequence_1:
    print("Conserved")
else:
    print("Not Conserved")
if sequence_3 == sequence_1:
    print("Conserved")
else:
    print("Not Conserved")
if sequence_x == sequence_1
    print("Conserved")
else:
    print("Not Conserved")

As you can obviously tell, this is probably the worst way of trying to accomplish what I'm trying to do. Help would be much appreciated, thanks!

1
  • Just to clarify you want to match the first three letter sequence with all the other sequences? What's the end result you need? locations of the matches? Commented Aug 7, 2014 at 21:23

2 Answers 2

3

A loop would definitely make this more efficient. Here's a possibility:

f = open("dna.txt","r")
sequence_1 = f.readline()
sequence_1 = f.readline()  # Get the actual sequence.
sequence_line = False      # This will switch back and forth to skip every other line.
for line in f:             # Iterate over all remaining lines.
  if sequence_line:        # Only test this every other line.
    if line == sequence_1:
      print("Conserved")
    else:
      print("Not Conserved")
  sequence_line = not sequence_line   # Switch the boolean every iteration.
f.close()

The sequence_line boolean indicates whether we are looking at a sequence line or not. The line sequence_line = not sequence_line will flip it back and forth for every loop iteration, so it's True every other time. That's how we can skip every other line and only compare the ones we care about.

This method may not be as fast as a list comprehension, but it prevents you from storing your entire file in memory, if it's prohibitively large. If you can fit it in memory, Emanuele Paolini's solution is probably going to be quite fast.

Sign up to request clarification or add additional context in comments.

3 Comments

I'm not getting your use of sequence_line here. What's it used for?
btw I find it easier to use with open(<filename>) as f: instead of forcing myself to open() and close().
@KronoS I never remember to do that because I'm stuck with Python 2.4.3 at work.
1
f = open("dna.txt")
lines = [line for line in f.readlines() if line[0] != '>']
for line in lines[1:]:
  if line == lines[0]:
    print "Conserved"
  else:
    print "Not Conserved"

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.