Script working in Python2 but not in Python 3 (hashlib)

Question

I worked today in a simple script to checksum files in all available hashlib algorithms (md5, sha1.....) I wrote it and debug it with Python2, but when I decided to port it to Python 3 it just won't work. The funny thing is that it works for small files, but not for big files. I thought there was a problem with the way I was buffering the file, but the error message is what makes me think it is something related to the way I am doing the hexdigest (I think) Here is a copy of my entire script, so feel free to copy it, use it and help me figure out what the problem is with it. The error I get when checksuming a 250 MB file is

"'utf-8' codec can't decode byte 0xf3 in position 10: invalid continuation byte"

I google it, but can't find anything that fixes it. Also if you see better ways to optimize it, please let me know. My main goal is to make work 100% in Python 3. Thanks

#!/usr/local/bin/python33
import hashlib
import argparse

def hashFile(algorithm = "md5", filepaths=[], blockSize=4096):
    algorithmType = getattr(hashlib, algorithm.lower())() #Default: hashlib.md5()
    #Open file and extract data in chunks   
    for path in filepaths:
        try:
            with open(path) as f:
                while True:
                    dataChunk = f.read(blockSize)
                    if not dataChunk:
                        break
                    algorithmType.update(dataChunk.encode())
                yield algorithmType.hexdigest()
        except Exception as e:
            print (e)

def main():
    #DEFINE ARGUMENTS
    parser = argparse.ArgumentParser()
    parser.add_argument('filepaths', nargs="+", help='Specified the path of the file(s) to hash')
    parser.add_argument('-a', '--algorithm', action='store', dest='algorithm', default="md5", 
                        help='Specifies what algorithm to use ("md5", "sha1", "sha224", "sha384", "sha512")')
    arguments = parser.parse_args()
    algo = arguments.algorithm
    if algo.lower() in ("md5", "sha1", "sha224", "sha384", "sha512"):

Here is the code that works in Python 2, I will just put it in case you want to use it without having to modigy the one above.

#!/usr/bin/python
import hashlib
import argparse

def hashFile(algorithm = "md5", filepaths=[], blockSize=4096):
    '''
    Hashes a file. In oder to reduce the amount of memory used by the script, it hashes the file in chunks instead of putting
    the whole file in memory
    ''' 
    algorithmType = hashlib.new(algorithm)  #getattr(hashlib, algorithm.lower())() #Default: hashlib.md5()
    #Open file and extract data in chunks   
    for path in filepaths:
        try:
            with open(path, mode = 'rb') as f:
                while True:
                    dataChunk = f.read(blockSize)
                    if not dataChunk:
                        break
                    algorithmType.update(dataChunk)
                yield algorithmType.hexdigest()
        except Exception as e:
            print e

def main():
    #DEFINE ARGUMENTS
    parser = argparse.ArgumentParser()
    parser.add_argument('filepaths', nargs="+", help='Specified the path of the file(s) to hash')
    parser.add_argument('-a', '--algorithm', action='store', dest='algorithm', default="md5", 
                        help='Specifies what algorithm to use ("md5", "sha1", "sha224", "sha384", "sha512")')
    arguments = parser.parse_args()
    #Call generator function to yield hash value
    algo = arguments.algorithm
    if algo.lower() in ("md5", "sha1", "sha224", "sha384", "sha512"):
        for hashValue in hashFile(algo, arguments.filepaths):
            print hashValue
    else:
        print "Algorithm {0} is not available in this script".format(algorithm)

if __name__ == "__main__":
    main()

A. Rodas · Accepted Answer · 2013-06-11 23:16:54Z

1

I haven't tried it in Python 3, but I get the same error in Python 2.7.5 for binary files (the only difference is that mine is with the ascii codec). Instead of encoding the data chunks, open the file directly in binary mode:

with open(path, 'rb') as f:
    while True:
        dataChunk = f.read(blockSize)
        if not dataChunk:
            break
        algorithmType.update(dataChunk)
    yield algorithmType.hexdigest()

Apart from that, I'd use the method hashlib.new instead of getattr, and hashlib.algorithms_available to check if the argument is valid.

answered Jun 11, 2013 at 23:16

A. Rodas

20.7k8 gold badges70 silver badges76 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

JohnnyLoo Over a year ago

Thanks i will avoid hashlib.algorithms_available since it is only available after 3.2 and not available in python2, but the hashlib.new sure looks cleaner.

JohnnyLoo Over a year ago

BTW i got that error you mention, but I solved it by removing the .encode() part when updating the "dataChunk" The second script I added should work for you

A. Rodas Over a year ago

@JuanCarlos It does not give any error, but it returns an incorrect md5sum unless you open the file in binary mode.

JohnnyLoo Over a year ago

That's weird, in my systems it works either way. I did the chechsum of 10 files with the script and with md5sum tool in Linux and I get the exact same result, same thing with sha1. Anyway, I works in my system eitherway, but just to be on the safe side I will open it in binary mode. Thanks

Collectives™ on Stack Overflow

Script working in Python2 but not in Python 3 (hashlib)

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related