Python comparing strings within a conditional

Question

I have a text file called dna.txt which contains:

>A
ACG
>B
CCG
>C
CCG
>D
TCA

I want to create a program using Python that will compare all the lines of the text file after the first sequence (ACG) to the first sequence (ACG), and print out "conserved" if the sequences are a match, and "not conserved" if the sequences are a mismatch. I did it using an extremely inefficient way that only goes up to 30 sequences in the file, and I was wondering how maybe a loop could be utilized to simplify this block of code. This is just a short sample of the inefficient method I used:

f = open("dna.txt")
sequence_1 = linecache.getline('dna.txt', 2)
sequence_2 = linecache.getline('dna.txt', 4)
sequence_3 = linecache.getline('dna.txt', 6)
sequence_x = linecache.getline('dna.txt', 2x)
f.close()
if sequence_2 == sequence_1:
    print("Conserved")
else:
    print("Not Conserved")
if sequence_3 == sequence_1:
    print("Conserved")
else:
    print("Not Conserved")
if sequence_x == sequence_1
    print("Conserved")
else:
    print("Not Conserved")

As you can obviously tell, this is probably the worst way of trying to accomplish what I'm trying to do. Help would be much appreciated, thanks!

Just to clarify you want to match the first three letter sequence with all the other sequences? What's the end result you need? locations of the matches? — James Mertz
– James Mertz, Commented Aug 7, 2014 at 21:23

TheSoundDefense · Accepted Answer · 2014-08-07 21:29:07Z

3

A loop would definitely make this more efficient. Here's a possibility:

f = open("dna.txt","r")
sequence_1 = f.readline()
sequence_1 = f.readline()  # Get the actual sequence.
sequence_line = False      # This will switch back and forth to skip every other line.
for line in f:             # Iterate over all remaining lines.
  if sequence_line:        # Only test this every other line.
    if line == sequence_1:
      print("Conserved")
    else:
      print("Not Conserved")
  sequence_line = not sequence_line   # Switch the boolean every iteration.
f.close()

The sequence_line boolean indicates whether we are looking at a sequence line or not. The line sequence_line = not sequence_line will flip it back and forth for every loop iteration, so it's True every other time. That's how we can skip every other line and only compare the ones we care about.

This method may not be as fast as a list comprehension, but it prevents you from storing your entire file in memory, if it's prohibitively large. If you can fit it in memory, Emanuele Paolini's solution is probably going to be quite fast.

edited Aug 7, 2014 at 21:29

answered Aug 7, 2014 at 21:23

TheSoundDefense

6,9851 gold badge32 silver badges43 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

James Mertz Over a year ago

I'm not getting your use of sequence_line here. What's it used for?

James Mertz Over a year ago

btw I find it easier to use with open(<filename>) as f: instead of forcing myself to open() and close().

TheSoundDefense Over a year ago

@KronoS I never remember to do that because I'm stuck with Python 2.4.3 at work.

Emanuele Paolini · Accepted Answer · 2014-08-07 21:27:34Z

1

f = open("dna.txt")
lines = [line for line in f.readlines() if line[0] != '>']
for line in lines[1:]:
  if line == lines[0]:
    print "Conserved"
  else:
    print "Not Conserved"

answered Aug 7, 2014 at 21:27

Emanuele Paolini

10.2k5 gold badges45 silver badges69 bronze badges

Collectives™ on Stack Overflow

Python comparing strings within a conditional

2 Answers 2

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related