2

I've got two binary files. They look something like this, but the data is more random:

File A:

FF FF FF FF 00 00 00 00 FF FF 44 43 42 41 FF FF ...

File B:

41 42 43 44 00 00 00 00 44 43 42 41 40 39 38 37 ...

What I'd like is to call something like:

>>> someDiffLib.diff(file_a_data, file_b_data)

And receive something like:

[Match(pos=4, length=4)]

Indicating that in both files the bytes at position 4 are the same for 4 bytes. The sequence 44 43 42 41 would not match because they're not in the same positions in each file.

Is there a library that will do the diff for me? Or should I just write the loops to do the comparison?

7
  • docs.python.org/2/library/difflib.html - first result in google for "diff in python" Commented Apr 3, 2013 at 21:26
  • possible duplicate of difference between two strings in python/php Commented Apr 3, 2013 at 21:27
  • @Andrey thanks, I tried that, but it appears that get_matching_blocks() doesn't check if the bytes are in the same spot in each files, just that the sequence exists in each file. Otherwise, yeah, that's pretty much what I want. Commented Apr 3, 2013 at 21:28
  • So you want to get a list of every position where a match starts and the length of that match, and you don't care about sections of the file that would match if they were lined up properly? Commented Apr 3, 2013 at 21:30
  • 2
    @KyleStrand yes, I think so. Although I'm not sure what "lined up properly" would mean in this case. In my example above, I do not want the 44 43 42 41 to match because they're in different positions; if that's what you mean. Commented Apr 3, 2013 at 21:31

2 Answers 2

11

You can use itertools.groupby() for this, here is an example:

from itertools import groupby

# this just sets up some byte strings to use, Python 2.x version is below
# instead of this you would use f1 = open('some_file', 'rb').read()
f1 = bytes(int(b, 16) for b in 'FF FF FF FF 00 00 00 00 FF FF 44 43 42 41 FF FF'.split())
f2 = bytes(int(b, 16) for b in '41 42 43 44 00 00 00 00 44 43 42 41 40 39 38 37'.split())

matches = []
for k, g in groupby(range(min(len(f1), len(f2))), key=lambda i: f1[i] == f2[i]):
    if k:
        pos = next(g)
        length = len(list(g)) + 1
        matches.append((pos, length))

Or the same thing as above using a list comprehension:

matches = [(next(g), len(list(g))+1)
           for k, g in groupby(range(min(len(f1), len(f2))), key=lambda i: f1[i] == f2[i])
               if k]

Here is the setup for the example if you are using Python 2.x:

f1 = ''.join(chr(int(b, 16)) for b in 'FF FF FF FF 00 00 00 00 FF FF 44 43 42 41 FF FF'.split())
f2 = ''.join(chr(int(b, 16)) for b in '41 42 43 44 00 00 00 00 44 43 42 41 40 39 38 37'.split())
Sign up to request clarification or add additional context in comments.

2 Comments

Hot. I'm loving what you're doing there. I was hoping for a beautiful answer like this.
Would itertools.groupby() be able to detect identical sequence at different (increasing) position? With "abcdef" and "abcZZef", it only detects the first sequence ([(0, 3)]). I would like it to also get the "ef" part: [(0, 3), (4, 2)]. As a PoC, I've changed the key function to lambda i: any([f1[i] == f2[i], f1[i] == f2[i+1]]) which works, but not sure if it's the right path.
3

The provided itertools.groupby solution works fine, but it's pretty slow.

I wrote a pretty naive attempt using numpy and tested it versus the other solution on a particular 16MB file I happened to have, and it was about 42x faster on my machine. Someone familiar with numpy could likely improve this significantly.

import numpy as np

def compare(path1, path2):
    x,y = np.fromfile(path1, np.int8), np.fromfile(path2, np.int8)
    length = min(x.size, y.size)
    x,y = x[:length], y[:length]

    z = np.where(x == y)[0]
    if(z.size == 0) : return z

    borders = np.append(np.insert(np.where(np.diff(z) != 1)[0] + 1, 0, 0), len(z))
    lengths = borders[1:] - borders[:-1]
    starts = z[borders[:-1]]
    return np.array([starts, lengths]).T

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.