implementing custom hashing method

Question

I am trying to implement a custom hashing method (multiplication method):

def h(k, M):
    y = hash(k)
    T = (math.sqrt(5)-1)/2
    return(int(M*((y*T) - int(y * T))))

It always returns zero. I tested it and (y*T) -> returns floating value (for example 10,666666). int(y * T) -> returns integer value (for example 10). But if i do (y*T) - int(y * T) it always returns 0.0. My target is to call something like h('test', 10) and get a number as return, but it always returns 0.0. Why is it so?

When I run this code and do h('test',10) it outputs 7, not 0. — hankd
– hankd, Commented Oct 29, 2013 at 21:26
wierd, is use Python 3.3.2 and doing h('test',10) returns always 0. Hank Ditton, which version did you use? can it be python version related? — MoneyPot
– MoneyPot, Commented Oct 29, 2013 at 21:28

Tim Peters · Accepted Answer · 2013-10-29 21:41:35Z

4

Are you running on a 64-bit system? If so, y will be a 64-bit integer, and T is about 0.6, so, for example,

>>> import random
>>> y = random.randrange(2**64) # some 64-bit int
>>> y
17364376918466400468
>>> yt = y * 0.6
>>> yt
1.041862615107984e+19
>>> yt - int(yt)
0.0

A float only has 53 bits of precision, so the odds greatly favor that there will be no bits "after the decimal point" when converting a 64-bit int to a float.

On a 32-bit system, hash() returns 32-bit ints, so this problem won't appear.

If this is the problem, then you could try various workarounds, like adding:

y = abs(y)
y = (y >> 32) ^ (y & 0xffffffff)  # collapse to 32 bits

answered Oct 29, 2013 at 21:41

Tim Peters

71.4k14 gold badges133 silver badges140 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

MoneyPot Over a year ago

yes, it was the problem! Adding the workaround fixed it. Big thanks!

Farmer Joe · Accepted Answer · 2013-10-29 22:10:43Z

This problem stems down to the way floating point numbers are stored in a computer.

In short: They are stored as a limited amount of significant figures, a base, and an exponent. The computer then knows to scale the significant figures by the base raised to the exponent in order to obtain the value. The amount of data is machine specific: for a 32-bit machine, 23-bits are used for the significant figure, 8 for the exponent, and 1 for the base, a 64-bit machine will have 53 bits for the sig figs, 8 for the exponent, and 1 for the base.

Addition and Subtraction are then done by adding/subtracting the difference between the significant digits and the exponent.

You are generating very large integers for hash(k) and trying to take the difference between the rounded down int(y*T) and the floating point y*T. When the Python interpreter tries to take the difference between a float and an int, it converts the int to a floating point y*T will store a certain number of significant figures. The problem arises when you try to get a low order of magnitude difference from two high order magnitude numbers, or generally anytime the order of magnitude for the difference differs greatly from the numbers involved. The low order significant digits will then be lost in calculation.

Here is the a version I edited to test out your method. The added argument c is a constant which I suspect will help normalize your results.

import math

def h(k,M,c):
    y = hash(k)
    print "hash = ", y
    T = (math.sqrt(5)-1)/(2*c)
    print "y*T = ", y*T
    print "int(y*T) = ", int(y*T)
    print "(y*T) - int(y * T) = ",(y*T) - int(y * T)
    print "M*((y*T) - int(y * T)) = ", M*((y*T) - int(y * T))
    return(int(M*((y*T) - int(y * T))))

print(h('test',2,c))

As you increase c, essentially making the difference in the two numbers occur at a closer and closer order of magnitude, you begin to see that the value for (y*T) - int(y * T) moves away from 0. Sample output is below:

>>>h('test',2,10)
hash =  2314058222102390712
y*T =  1.43016663321e+17
int(y*T) =  143016663320543088
(y*T) - int(y * T) =  0.0
M*((y*T) - int(y * T)) =  0.0
h(test,2,10) =  0
>>>h('test',2,1000)
hash =  2314058222102390712
y*T =  1.43016663321e+15
int(y*T) =  1430166633205430
(y*T) - int(y * T) =  0.75
M*((y*T) - int(y * T)) =  1.5
h(test,2,1000) =  1

>>>h('test',2,10000000)
hash =  2314058222102390712
y*T =  1.43016663321e+11
int(y*T) =  143016663320
(y*T) - int(y * T) =  0.543090820312
M*((y*T) - int(y * T)) =  1.08618164062
h(test,2,10000000) =  1

>>>h('test',2,10000000000000)
hash =  2314058222102390712
y*T =  143016.663321
int(y*T) =  143016
(y*T) - int(y * T) =  0.66332054307
M*((y*T) - int(y * T)) =  1.32664108614
h(test,2,10000000000000) =  1

As an added example of the phenomenon I am talking about:

y = hash('test')
print y
y = float(y)
print y
y = int(y)
print y

outputs:

2314058222102390712
2.3140582221e+18
2314058222102390784

Just by simply switching to a float and back to an int the last two digits are no longer reliable, so it can be seen that anything below this will also be lost.

Collectives™ on Stack Overflow

implementing custom hashing method

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related