1

I am looking to convert a file to binary for a project, preferably using Python as I am most comfortable with it, though if walked-through, I could probably use another language.

Basically, I need this for a project I am working on where we want to store data using a DNA strand and thus need to store files in binary ('A's and 'T's = 0, 'G's and 'C's = 1)

Any idea how I could proceed? I did find that use could encode in base64, then decode it, but it seems a bit inefficient, and the code that I have doesn't seem to work...

import base64
import tkinter as tk
from tkinter import filedialog

root = tk.Tk()
root.withdraw()
file_path = filedialog.askopenfilename()
print(file_path)
with open(file_path) as f:
    encoded = base64.b64encode(f.readlines())
    print(encoded)

Also, I already have a program to do that simply with text. Any tips on how to improve it would also be appreciated!

import binascii
t = bytearray(str(input("Texte?")), 'utf8')
h = binascii.hexlify(t)
b = bin(int(h, 16)).replace('b','') 
#removing the b that appears in the end for some reason
g = b.replace('1','G').replace('0','A')
print(g)

For example, if I input test: ok so for the text to DNA: I input 'test' and expect the DNA sequence that comes from the binary the binary being: 01110100011001010111001101110100 (Also I asked to print every conversion in the example so that it is more comprehensible)

>>>Texte?test #Asks the text
>>>b'74657374' #converts to hex
>>>01110100011001010111001101110100 #converts to binary
>>>AGGGAGAAAGGAAGAGAGGGAAGGAGGGAGAA #converts 0 to A and 1 to G
8
  • 4
    If you're going from four characters to two, aren't you inevitably losing information? How can you get it back again? Commented Oct 26, 2015 at 16:26
  • Do you mean because we are using A and T for 0 and G and C for 1? Commented Oct 26, 2015 at 17:07
  • Well since the information at the beginning is in binary I don't see how that would make us lose information (I'm maybe not explaining it well...) Commented Oct 26, 2015 at 17:08
  • I'd say definitely not. Could you give a minimal reproducible example, including sample inputs and expected and actual outputs? Commented Oct 26, 2015 at 17:10
  • Edit the question, you donut! Commented Oct 26, 2015 at 17:17

2 Answers 2

2

So, thanks to @jonrshape and Sergey Vturin, I finally was able to achieve what I wanted! My program asks for a file, turns it into binary, which then gives me its equivalent in "DNA code" using pairs of binary numbers (00 = A, 01 = T, 10 = G, 11 = C)

import binascii
from tkinter import filedialog

file_path = filedialog.askopenfilename()

x = ""
with open(file_path, 'rb') as f:
    for chunk in iter(lambda: f.read(32), b''):
        x += str(binascii.hexlify(chunk)).replace("b","").replace("'","")
b = bin(int(x, 16)).replace('b','')
g = [b[i:i+2] for i in range(0, len(b), 2)]
dna = ""
for i in g:
    if i == "00":
        dna += "A"
    elif i == "01":
        dna += "T"
    elif i == "10":
        dna += "G"
    elif i == "11":
        dna += "C"
print(x) #hexdump
print(b) #converted to binary
print(dna) #converted to "DNA"
Sign up to request clarification or add additional context in comments.

Comments

0

Of course, it is inefficient!
base64 is designed to store binary in a text. It makes a bigger size block after conversion.

btw: what efficiency do you want? compactness?

if so: second sample is much nearer to what you want

btw: in your task you loose information! Are you aware of this?

Here is a sample how to store and restore.

It stores data in an easy to understand Hex-In-Text format -- just for the sake of a demo. If you want compactness - you can easily modify the code so as to store in binary file or if you want 00011001 view - modification will be easy too.

import math
#"make a long test string"
import numpy as np
s=''.join((str(x) for x in np.random.randint(4,size=33)))\
    .replace('0','A').replace('1','T').replace('2','G').replace('3','C')

def store_(s):
    size=len(s) #size will changed to fit 8*integer so remember true value of it and store with data
    s2=s.replace('A','0').replace('T','0').replace('G','1').replace('C','1')\
        .ljust( int(math.ceil(size/8.)*8),'0') #add '0' to 8xInt to the right
    a=(hex( eval('0b'+s2[i*8:i*8+8]) )[2:].rjust(2,'0') for i in xrange(len(s2)/8))
    return ''.join(a),size

yourDataAsHexInText,sizeToStore=store_(s)
print yourDataAsHexInText,sizeToStore


def restore_(s,size=None):
    if size==None: size=len(s)/2
    a=( bin(eval('0x'+s[i*2:i*2+2]))[2:].rjust(8,'0') for i in xrange(len(s)/2))
    #you loose information, remember?, so it`s only A or G
    return (''.join(a).replace('1','G').replace('0','A') )[:size]

restore_(yourDataAsHexInText,sizeToStore)


print "so check it"
print s ,"(input)"
print store_(s)
print s.replace('C','G').replace('T','A') ,"to compare with information loss"
print restore_(*store_(s)),"restored"
print s.replace('C','G').replace('T','A') == restore_(*store_(s))

result in my test:

63c9308a00 33
so check it
AGCAATGCCGATGTTCATCGTATACTTTGACTA (input)
('63c9308a00', 33)
AGGAAAGGGGAAGAAGAAGGAAAAGAAAGAGAA to compare with information loss
AGGAAAGGGGAAGAAGAAGGAAAAGAAAGAGAA restored
True

7 Comments

Thanks a lot for your answer! But I am a bit lost, could you please explain a bit more your code? Also what is the input supposed to be here? Because it seems that you are trying inputting only A, T, G and C
maybe i not so good understand, but yes- this sample wait a string of "A, T, G and C" if you want to binary to dna as in your sample then you could use only modified restore_ (just modify to bin instead of hex) explain: store_ split input string to groups by 8, interpret each group as binary integer value and store in (in hex, but you can store in any format you want). restore_ interpret evert 2-symbol fragment as integer (here you can change to any format you want)- and convert it back.
Oh. So to be concise: I want to convert a file or a text to DNA. So like on the example I enter 'test' and it returns the 'equivalent' in DNA. That is what I want for files. So I would need to have the file turned into binary to be able to convert it into DNA
it's meaningless in my point of view, but it's easy. if consider every char of input string as binary. // s="test" (''.join((bin(ord(x))[2:].rjust(8,'0') for x in s)).replace('1','G').replace('0','A') ) // you can use that isolated or like in a sample- it is modified second string of restore_
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.