Diffing Binary Files In Python

Question

I've got two binary files. They look something like this, but the data is more random:

File A:

FF FF FF FF 00 00 00 00 FF FF 44 43 42 41 FF FF ...

File B:

41 42 43 44 00 00 00 00 44 43 42 41 40 39 38 37 ...

What I'd like is to call something like:

>>> someDiffLib.diff(file_a_data, file_b_data)

And receive something like:

[Match(pos=4, length=4)]

Indicating that in both files the bytes at position 4 are the same for 4 bytes. The sequence 44 43 42 41 would not match because they're not in the same positions in each file.

Is there a library that will do the diff for me? Or should I just write the loops to do the comparison?

docs.python.org/2/library/difflib.html - first result in google for "diff in python" — Andrey
– Andrey, Commented Apr 3, 2013 at 21:26
possible duplicate of difference between two strings in python/php — Andrey
– Andrey, Commented Apr 3, 2013 at 21:27
@Andrey thanks, I tried that, but it appears that get_matching_blocks() doesn't check if the bytes are in the same spot in each files, just that the sequence exists in each file. Otherwise, yeah, that's pretty much what I want. — omghai2u
– omghai2u, Commented Apr 3, 2013 at 21:28
So you want to get a list of every position where a match starts and the length of that match, and you don't care about sections of the file that would match if they were lined up properly? — Kyle Strand
– Kyle Strand, Commented Apr 3, 2013 at 21:30
@KyleStrand yes, I think so. Although I'm not sure what "lined up properly" would mean in this case. In my example above, I do not want the 44 43 42 41 to match because they're in different positions; if that's what you mean. — omghai2u
– omghai2u, Commented Apr 3, 2013 at 21:31

Andrew Clark · Accepted Answer · 2013-04-03 21:49:54Z

11

You can use itertools.groupby() for this, here is an example:

from itertools import groupby

# this just sets up some byte strings to use, Python 2.x version is below
# instead of this you would use f1 = open('some_file', 'rb').read()
f1 = bytes(int(b, 16) for b in 'FF FF FF FF 00 00 00 00 FF FF 44 43 42 41 FF FF'.split())
f2 = bytes(int(b, 16) for b in '41 42 43 44 00 00 00 00 44 43 42 41 40 39 38 37'.split())

matches = []
for k, g in groupby(range(min(len(f1), len(f2))), key=lambda i: f1[i] == f2[i]):
    if k:
        pos = next(g)
        length = len(list(g)) + 1
        matches.append((pos, length))

Or the same thing as above using a list comprehension:

matches = [(next(g), len(list(g))+1)
           for k, g in groupby(range(min(len(f1), len(f2))), key=lambda i: f1[i] == f2[i])
               if k]

Here is the setup for the example if you are using Python 2.x:

f1 = ''.join(chr(int(b, 16)) for b in 'FF FF FF FF 00 00 00 00 FF FF 44 43 42 41 FF FF'.split())
f2 = ''.join(chr(int(b, 16)) for b in '41 42 43 44 00 00 00 00 44 43 42 41 40 39 38 37'.split())

edited Apr 3, 2013 at 21:49

answered Apr 3, 2013 at 21:43

Andrew Clark

210k36 gold badges285 silver badges310 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

omghai2u Over a year ago

Hot. I'm loving what you're doing there. I was hoping for a beautiful answer like this.

CDuv Nov 18 at 0:51

Would itertools.groupby() be able to detect identical sequence at different (increasing) position? With "abcdef" and "abcZZef", it only detects the first sequence ([(0, 3)]). I would like it to also get the "ef" part: [(0, 3), (4, 2)]. As a PoC, I've changed the key function to lambda i: any([f1[i] == f2[i], f1[i] == f2[i+1]]) which works, but not sure if it's the right path.

Community · Accepted Answer · 2017-05-23 12:01:31Z

3

The provided itertools.groupby solution works fine, but it's pretty slow.

I wrote a pretty naive attempt using numpy and tested it versus the other solution on a particular 16MB file I happened to have, and it was about 42x faster on my machine. Someone familiar with numpy could likely improve this significantly.

import numpy as np

def compare(path1, path2):
    x,y = np.fromfile(path1, np.int8), np.fromfile(path2, np.int8)
    length = min(x.size, y.size)
    x,y = x[:length], y[:length]

    z = np.where(x == y)[0]
    if(z.size == 0) : return z

    borders = np.append(np.insert(np.where(np.diff(z) != 1)[0] + 1, 0, 0), len(z))
    lengths = borders[1:] - borders[:-1]
    starts = z[borders[:-1]]
    return np.array([starts, lengths]).T

edited May 23, 2017 at 12:01

CommunityBot

11 silver badge

answered Jun 16, 2015 at 19:10

Kevin

2,0684 gold badges22 silver badges24 bronze badges

Collectives™ on Stack Overflow

Diffing Binary Files In Python

2 Answers 2

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related