14

I need a (preferably simple and fast) image hashing algorithm. The hash value is used in a lookup table, not for cryptography.

Some of the images are "computer graphic" - i.e. solid-color filled rects, rasterized texts and etc., whereas there are also "photographic" images - containing rich color spectrum, mostly smooth, with reasonable noise amplitude.

I'd also like the hashing algorithm to be able to be applied to specific image parts. I mean, the image can be divided into a grid cells, and the hash function of each cell should depend only on the contents of this cell. So that one may spot quickly if two images have common areas (in case they're aligned appropriately).

Note: I only need to know if two images (or their parts) are identical. That is, I don't need to match similar images, there's no need in feature recognition, correlation, and other DSP techniques.

I wonder what is the preferred hashing algorithm.

For "photographic" images just XOR-ing all the pixels within a grid cell is ok more-or-less. The probability of the same hash value for different images is pretty low, especially because the presence of the (nearly white) noise breaks all the potential symmetries. Plus the spectrum of such a hash function looks good (any value is possible with nearly the same probability).

But such a naive algorithm may not be used with "artificial" graphics. Identical pixels, repeating patterns, geometrical offset invariance are very common for such images. XOR-ing all the pixels will give 0 for any image with even number of identical pixels.

Using something like CRT-32 looks somewhat promising, but I'd like to figure-out something faster. I thought about iterative formula, each new pixel mutates the current hash value, like this:

hashValue = (hashValue * /*something*/ | newPixelValue) % /* huge prime */

Doing modulo prime number should probably give a good dispersion, so that I'm leaning toward this option. But I'd like to know if there are better varians.

Thanks in advance.

4
  • why don't you use some plain hashing algorithm like md5? Commented Jul 6, 2012 at 7:18
  • @Karoly Horvath: Good question. Indeed this is what I need more-or-less. However MD5 is (presumably) CPU-hungry, it's designed to be a one-way hash function. OTOH I need something much simpler, since I have no security considerations. I though about CRC-32. But I'd like to figure-out something even simpler Commented Jul 6, 2012 at 14:06
  • If you do this on a lot of images, the bottleneck is going to be your disk speed.. Commented Jul 6, 2012 at 14:09
  • @Karoly Horvath: Who said it'll be on the disk? To be precise I'll give you the usage scenario: There'll be typically up to 100-200 images stored in memory (of various sizes, "typical" for a desktop computer applications). Whenever I "see" a new image - I want to know if it coincides with one I saw earlier. Commented Jul 6, 2012 at 18:42

2 Answers 2

8

Have a look at this tutorial on the phash algorithm http://www.hackerfactor.com/blog/index.php?/archives/432-Looks-Like-It.html which is used to find closely matching images.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for your attention, but this is not what I want IMHO. The described algorithm is good for finding "similar" images, it's also scale-invariant. My problem is much simpler, and I want a much more efficient solution
7

If you want to make it very fast, you should consider taking a random subset of the pixels to avoid reading the entire image. Next, compute a hash function on the sequence of values at those pixels. The random subset should be selected by a deterministic pseudo-random number generator with fixed seed so that identical images produce identical subsets and consequently identical hash values.

This should work reasonably well even for artificial images. However, if you have images which differ from each other by a small number of pixels, this is going to give hash collisions. More iterations give better reliability. If that is the case, for instance, if your images set is likely to have pairs with one different pixel, you must read every pixel to compute the hash value. Taking a simple linear combination with pseudo-random coefficients would be good enough even for artificial images.

pseudo-code of a simple algorithm

Random generator = new generator(2847)  // Initialized with fixed seed
int num_iterations = 100

int hash(Image image) {
    generator.reset()   //To ensure consistency on each evaluation
    int value = 0
    for num_iteration steps {
        int nextValue = image.getPixel(generator.nextInt()%image.getSize()).getValue()
        value = value + nextValue*generator.nextInt()
    }
    return value
}

2 Comments

Thanks for the answer. I have no problem to read the entire grid cell. My grid cells are pretty small (8x8 or 16x16). Moreover, when the hash values of two images are equal - I ensure nevertheless that the images are equal. The only parameter missing is the hash function itself. What should it be?
If you don't require cryptographic security, and only worried about artificial images, then a simple linear combination of the pixel-values with random coefficients should suffice, as I described. The problem is analogous to finding the hash of an integer array such as v1 = [34,2,4,92,3], v2 = [10,3,5,20,3]. Your goal is to find hashes of them to see which ones are equal. Pick a randomly chosen fixed vector m = [72,37,1,4,34] initially. For every input vector, the hash value of v1 is v1*m = 34*72 + 2*37 + 4*1 + 92*4 + 3*34. You can calculate this number modulo any prime too, if you like.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.