hashCode, implementation, and relation to HashMap

Question

So I asked another related question here: java string hash function with avalanche effect, but I have a different, related question now.

What I established in that question was that the hashCode() function for String does not have an avalanche effect. This means, for example, that if I have strings "k1", "k2", "k3", and I call hashCode() on each, the values returned will be contiguous.

Now, based on my recollection of data structures 101, I was under the impression that this is a bad thing. Because assuming that HashMap chooses buckets by an algorithm something like:

class HashMap {
    private int capacity;
    private int chooseBucket(String key) {
        return key.hashCode() % capacity;
    }
}

It would mean that similar keys are stored in contiguous buckets, leading to a higher rate of collisions, degrading big-O lookup time from O(1) to be...who knows how bad...maybe worse than O(log n).

The types of answers I got to my first question were along the lines of 'avalanche effect isn't needed here', 'it's only for cryptography hash functions', and 'the hashCode implementation for strings is fast and works well for small hash maps'.

Which confuses me. All data structures are fast when they're small. Wouldn't Sun provide a default hashCode function that will work well for large data sets? That's when the performance of HashMap really matters anyway, isn't it?

Or, am I missing something? Please enlighten me.

"similar keys are stored in contiguous buckets, leading to a higher rate of collisions" -- why would this lead to more collisions? — casablanca
– casablanca, Commented Mar 4, 2011 at 2:50

Matt Ball · Accepted Answer · 2011-03-04 03:04:09Z

5

Storing keys in contiguous buckets does not cause performance degradation. Storing keys in the same bucket (e.g., chaining) does. When using chaining to resolve hash collisions:

Worst-case scenario: that every hash value is the same, so all elements end up in the same bucket, in which case you get O(n) performance (assuming the chains are linked lists)
Best-case scenario: every hash value is different, so each element ends up in a different bucket, so you get the expected O(1) performance.

Hash codes for use in hash tables (and the like) do not need an avalanche effect.

answered Mar 4, 2011 at 3:04

Matt Ball

361k102 gold badges655 silver badges725 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Kevin Over a year ago

Without typing the whole page from my data strutcures book, it says that there are 2 conditions to a good hash function: 1) should be easy and fast to compute. 2) should scatter data evenly throughout the table. Two issues to consider with regard to point 2, a) how well does it scatter random data?, b) how well does it scatter non-random data. My question is related to point (b). Java's default hashCode method does not scatter non-random keys evenly.

Neil Coffey Over a year ago

Kevin, have a look at this and see if it helps: items need to be scattered in the sense that there needs to be a more or less 50% of bottom bits of the hash code being set. With random strings, String.hashCode() achieves this well enough for most purposes: javamex.com/tutorials/collections/…

Kevin Over a year ago

@Neil Coffey: Definitely interesting, but particularly this link: javamex.com/tutorials/collections/…. Although it doesn't really answer my question, because it talks about 'typical random keys', and my question relates to how well buckets will be chosen given non-random keys.

Neil Coffey Over a year ago

Well, it depends what you mean by "non-random". If you mean "deliberately chosen to expose a weakness in the hash function", well yes, you can always do that. The feature you mention of string hash code-- that a consecutive change to the final letter gives a consecutive change to the hash code-- doesn't particularly matter in most cases. So long as the hash codes are likely to be different in some way, then this will generally suffice. As another poster has said, what String.hashCode() does turns out to be fast and "random enough", which is the tradeoff you usually want.

corsiKa · Accepted Answer · 2011-07-29 22:03:35Z

You asked "Or, am I missing something? Please enlighten me."

Yes, you are missing something.

Inside the HashMap class implementation, it protects against poor hashing functions:

/**
 * Applies a supplemental hash function to a given hashCode, which
 * defends against poor quality hash functions.  This is critical
 * because HashMap uses power-of-two length hash tables, that
 * otherwise encounter collisions for hashCodes that do not differ
 * in lower bits. Note: Null keys always map to hash 0, thus index 0.
 */
static int hash(int h) {
    // This function ensures that hashCodes that differ only by
    // constant multiples at each bit position have a bounded
    // number of collisions (approximately 8 at default load factor).
    h ^= (h >>> 20) ^ (h >>> 12);
    return h ^ (h >>> 7) ^ (h >>> 4);
}

So your resulting hashcodes in your example are:

k1 - Before: 3366 After: 3566
k2 - Before: 3367 After: 3567
k3 - Before: 3368 After: 3552

So even in your small sample size of 3 elements, one of them got rehashed. Now, this doesn't guard against aggressively evil hashCodes (return randomInt(); or return 4; simply can't be protected against) but it does guard against poorly written hashcodes.

I should also point out that you can change things a great deal by using non trivial inputs. Consider for example the following strings.

k1longer - Before: 1237990607 After: 1304548342
k2longer - Before: 2125494288 After: 2040627866
k3longer - Before: -1281969327 After: -1178377711

Notice how much different the lower bits are: those are the only things that matter for a hashcode is the low bits. The size of the backing map is always a power of two. It's actually documented that way in the code:

/**
 * The table, resized as necessary. Length MUST Always be a power of two.
 */
transient Entry[] table;

The rehashing does a fairly decent job of making sure that the upper bits (which are typically ignored in the hash table) still have an impact on the lower bits. Here's the mapping of original hashcode positions and the bits they effect:

00: 00000000000000000000000000000001
01: 00000000000000000000000000000010
02: 00000000000000000000000000000100
03: 00000000000000000000000000001000
04: 00000000000000000000000000010001
05: 00000000000000000000000000100010
06: 00000000000000000000000001000100
07: 00000000000000000000000010001001
08: 00000000000000000000000100010010
09: 00000000000000000000001000100100
10: 00000000000000000000010001001000
11: 00000000000000000000100010010000
12: 00000000000000000001000100100001
13: 00000000000000000010001001000010
14: 00000000000000000100010010000100
15: 00000000000000001000100100001000
16: 00000000000000010001001000010001
17: 00000000000000100010010000100010
18: 00000000000001000100100001000100
19: 00000000000010001001000010001001
20: 00000000000100010010000100010011
21: 00000000001000100100001000100110
22: 00000000010001001000010001001100
23: 00000000100010010000100010011000 # means a 1 in the 23rd bit position will  
24: 00000001000100100001000100110001  # cause positions 4, 5, 8, 12, and 20 to 
25: 00000010001001000010001001100010  # also be altered
26: 00000100010010000100010011000100
27: 00001000100100001000100110001001
28: 00010001001000010001001100010010
29: 00100010010000100010011000100100
30: 01000100100001000100110001001000
31: 10001001000010001001100010010000

So, your worries about "degrading big-O lookup time from O(1) to be...who knows how bad...maybe worse than O(log n)" and "Wouldn't Sun provide a default hashCode function that will work well for large data sets?" may be put to rest - they have safeguards in place to prevent that from happening.

If it helps you get some peace at all, here's the author tags for this class. They're literally all stars in the Java world. (comments with # are mine)

 * @author  Doug Lea          # Formerly a Java Community Process Executive Committee member
 * @author  Josh Bloch        # Chief Java architect at Google, amongst other things
 * @author  Arthur van Hoff   # Done too many hardcore Java things to list...
 * @author  Neal Gafter       # Now a lead on the C# team at Microsoft, used to be team lead on javac

Community · Accepted Answer · 2020-06-20 09:12:55Z

2

I read a blog entry from Eric Lippert the other day titled Guidelines and rules for GetHashCode. Although the code examples are relevant to C#, most of the general principles apply equally well to Java. This article is well worth a read if you want to understand more about what hash codes are used for and how they should be generated.

In particular, the following bit seems particularly relevant to your question:

Guideline: the distribution of hash codes must be "random"

By a "random distribution" I mean that if there are commonalities in the objects being hashed, there should not be similar commonalities in the hash codes produced.

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered Mar 4, 2011 at 3:15

Greg Hewgill

1.0m192 gold badges1.2k silver badges1.3k bronze badges

3 Comments

Matt Ball Over a year ago

I don't think that quote helps to answer the question. The way I read it, it sounds like it does want an avalanche effect in the hash code, which is just not necessary for hash tables.

Andrew White Over a year ago

I kinda agree. Especially the distribution "must be" random part. You want uniqueness but I don't think it is required for containers.

Kevin Over a year ago

Hwegill: Indeed. This is just my point. Thank you for verifying that I'm not crazy.

Andrew White · Accepted Answer · 2011-03-04 03:33:32Z

A hashing function for something like a HashMap needs to be reasonably unique for it's key set but the relationship between the keys (ie how alike two keys are) need not be random. What we really want to avoid is a bunch of objects in a single bucket which would make searching that bucket expensive.

In the case of HashMaps and Strings it has to map those hashed keys into some sort offset to a random accessible container such as an array for which there are a number of solutions but if two keys are "close" it will still result in them being placed in different buckets which is all we really care about.

For very large Map containers (think billions of keys) we probably do want to be a little more clever but that seems beyond what Java's HashMap was designed for.

One final note, you don't have to use the avalanche effect to produce fairly random keys for Strings. You want to choose a function that is random enough and fast as possible.

Vinod R · Accepted Answer · 2011-03-04 03:59:43Z

1

If you have a look at the source code of HashMap there is hash function called with the key.hashCode() value which means it goes through its own way of assigning a hash. One point to be sure about is not to obey the equals and hashcode contract. I would suggest that if you are looking for performance improvement to look into the source code and understand the number of buckets available and the optimum usage of it.

answered Mar 4, 2011 at 3:59

Vinod R

1,21610 silver badges13 bronze badges

1 Comment

Kevin Over a year ago

Thanks, looking at the source was interesting. For reference, grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/…. The problem is that the supplemental hash function is only to guard against bad hashCode implementations. Interestingly, the implementation uses a table size that's power of 2, which my data structures book also warns against (it recommends table size == some prime number).

Collectives™ on Stack Overflow

hashCode, implementation, and relation to HashMap

5 Answers 5

4 Comments

Comments

3 Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

4 Comments

Comments

3 Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related