improve hashing using genetic programming/algorithm

Question

I'm writing a program which can significantly lessen the number of collisions that occur while using hash functions like 'key mod table_size'. For this I would like to use Genetic Programming/Algorithm. But I don't know much about it. Even after reading many articles and examples I don't know that in my case (as in program definition) what would be the fitness function, target (target is usually the required result), what would pose as the population/individuals and parents, etc.

Please help me in identifying the above and with a few codes/pseudo-codes snippets if possible as this is my project.

Its not necessary to be using genetic programming/algorithm, it can be anything using evolutionary programming/algorithm.

thanks..

Kris · Accepted Answer · 2014-11-04 11:48:06Z

My advice would be: don't do this that way. The literature on hash functions is vast and we more or less understand what makes a good hash function. We know enough mathematics not to look for them blindly. If you need a hash function to use, there is plenty to choose from.

However, if this is your uni project and you cannot possibly change the subject or steer it in a more manageable direction, then as you noticed there will be complex issues of getting fitness function and mutation operators right. As far as I can tell off the top of my head, there are no obvious candidates.

You may look up e.g. 'strict avalanche criterion' and try to see if you can reason about it in terms of fitness and mutations.

Another question is how do you want to represent your function? Just a boolean expression? Something built from word operations like AND, XOR, NOT, ROT ? Depending on your constraints (or rather, assumptions) the question of fitness and mutation will be different.

Persixty · Accepted Answer · 2014-11-04 14:36:58Z

Broadly fitness is clearly minimize the number of collisions in your 'hash modulo table-size' model. The obvious part is to take a suitably large and (very important) representative distribution of keys and chuck them through your 'candidate' function.

Then you might pass them through 'hash modulo table-size' for one or more values of table-size and evaluate some measure of 'niceness' of the arising distribution(s).

So what that boils down to is what table-sizes to try and what niceness measure to apply. Niceness is context dependent. You might measure 'fullest bucket' as a measure of 'worst case' insert/find time. You might measure sum of squares of bucket sizes as a measure of 'average' insert/find time based on uniform distribution of amongst the keys look-up.

Finally you would need to decide what table-size (or sizes) to test at. Conventional wisdom often uses primes because hash modulo prime tends to be nicely volatile to all the bits in hash where as something like hash modulo 2^n only involves the lower n-1 bits. To keep computation down you might consider the series of next prime larger than each power of two. 5(>2^2) 11 (>2^3), 17 (>2^4) , etc. up to and including the first power of 2 greater than your 'sample' size.

There are other ways of considering fitness but without a practical application the question is (of course) ill-defined.

If your 'space' of potential hash functions don't all have the same execution time you should also factor in 'cost'. It's fairly easy to define very good hash functions but execution time can be a significant factor.

Collectives™ on Stack Overflow

improve hashing using genetic programming/algorithm

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related