Does anyone know how Java implements its hash tables (HashSet or HashMap)? Given the various types of objects that one may want to put in a hash table, it seems very difficult to come up with a hash function that would work well for all cases.
-
13If only there was some way of looking at the Java source code, goddamit!oxbow_lakes– oxbow_lakes2009-10-29 23:59:08 +00:00Commented Oct 29, 2009 at 23:59
-
3Not everyone realises it's available, unfortunately. That's why this site is here. If you have the link, post it (as JG did below)Brian Agnew– Brian Agnew2009-10-30 00:09:36 +00:00Commented Oct 30, 2009 at 0:09
5 Answers
HashMap and HashSet are very similar. In fact, the second contains an instance of the first.
A HashMap contains an array of buckets in order to contain its entries. Array size is always powers of 2. If you don't specify another value, initially there are 16 buckets.
When you put an entry (key and value) in it, it decides the bucket where the entry will be inserted calculating it from its key's hashcode (hashcode is not its memory address, and the the hash is not a modulus). Different entries can collide in the same bucket, so they'll be put in a list.
Entries will be inserted until they reach the load factor. This factor is 0.75 by default, and is not recommended to change it if you are not very sure of what you're doing. 0.75 as load factor means that a HashMap of 16 buckets can only contain 12 entries (16*0.75). Then, an array of buckets will be created, doubling the size of the previous. All entries will be put again in the new array. This process is known as rehashing, and can be expensive.
Therefore, a best practice, if you know how many entries will be inserted, is to construct a HashMap specifying its final size:
new HashMap(finalSize);
2 Comments
HashMap takes the capacity, so you should divide by the load factor and round up, but I think specifying the capacity is almost always more effort than it is worth.)Java depends on each class' implementation of the hashCode() method to distribute the objects evenly. Obviously, a bad hashCode() method will result in performance problems for large hash tables. If a class does not provide a hashCode() method, the default in the current implementation is to return some function (i.e. a hash) of the the object's address in memory. Quoting from the API doc:
As much as is reasonably practical, the hashCode method defined by class Object does return distinct integers for distinct objects. (This is typically implemented by converting the internal address of the object into an integer, but this implementation technique is not required by the JavaTM programming language.)
6 Comments
Object.hashCode you should see that it clearly does not return the memory address. (On reasonable implementations, the value is stored in the object header. Sun's implementation initialises the value on first use, using a slight rehash of the memory address at the time of initialisation.)There are two general ways to implement a HashMap. The difference is how one deals with collisions.
The first method, which is the one Java users, makes every bucket in a the HashMap contain a singly linked list. To accomplish this, each bucket contains an Entry type, which caches the hashCode, has a pointer to the key, pointer to the value, and a pointer to the next entry. When a collision occurs in Java, another entry is added to the list.
The other method for handling collisions, is to simply put the item into the next empty bucket. The advantage of this method is it requires less space, however, it complicates removals, as if the bucket following the removed item is not empty, one has to check to see if that item is in the right or wrong bucket, and shift the item if it originally has collided with the item being removed.
1 Comment
IdentityHashMap and the hash map used in ThreadLocal use the probing algorithm (in Sun's JRE).In my own words:
An Entry object is created to hold the reference of the Key and Value.
The HashMap has an array of Entry's.
The index for the given entry is the hash returned by key.hashCode()
If there is a collision ( two keys gave the same index ) , the entry is stored in the .next attribute of the existing entry.
That's how two objects with the same hash could be stored into the collection.
From this answer we get:
public V get(Object key) {
if (key == null)
return getForNullKey();
int hash = hash(key.hashCode());
for (Entry<K,V> e = table[indexFor(hash, table.length)];
e != null;
e = e.next) {
Object k;
if (e.hash == hash && ((k = e.key) == key || key.equals(k)))
return e.value;
}
return null;
}
Let me know if I got something wrong.