HashMap Internal Implementation Analysis in Java
Everything you need to know about Java's HashMap.
Join the DZone community and get the full member experience.
Join For FreeNote: Based on the response from my technical blog I am posting this article here so that it would be visible to a wider audience.
java.util.HashMap.java
/**
* The maximum capacity, used if a higher value is implicitly specified
* by either of the constructors with arguments.
* MUST be a power of two <= 1<<30.
*/
static final int MAXIMUM_CAPACITY = 1 << 30;
java.util.HashMap.java
/**
* The default initial capacity - MUST be a power of two.
*/
static final int DEFAULT_INITIAL_CAPACITY = 16;
/**
* The load factor used when none specified in constructor.
*/
static final float DEFAULT_LOAD_FACTOR = 0.75f;
This code block defines the default size of an array as 16 (always a power of 2) and the load factor as 0.75, so that the HashMap's capacity will double in size by recomputing the hashcodes of the existing data structure elements any time the HashMap reaches 75% (in this case 12) of its current size (16). To avoid rehashing the data structure as elements grow, it is best practice to explicitly give the size of the HashMap when creating it.
Do you foresee any problem with this resizing of hashmap in java?
Since java is multi-threaded, it is very possible that more than one thread might be using the same HashmMap, and then they both realize the need for resizing the HashMap at the same time, which leads to a race condition.
What is a race condition with respect to hashmaps?
When two or more threads see the need for resizing the same HashMap, they might end up adding the elements of the old bucket to the new bucket simultaneously. As a result, this might lead to infinite loops. In case of collision, i.e, when there are different keys with the same same hashcode, internally, we use a single linked list to store elements. We store every new element at the head of the linked list to avoid tail traversing. At the time of resizing, the entire sequence of objects in the linked list gets reversed, during which there are chances of infinite loops.
For example, let's assume there are three keys with the same hashcode stored in a linked list inside a bucket:
[below format is in object_value(current_address, next_address) ]
Initial structure: 1(100, 200) –> 2(200, 300) –> 3(300, null)
After resizing by thread-1: 3(300, 200) –> 2(200, 100) –> 1(100, null)
When thread-2 starts resizing, its again starts with 1st element by placing it at the head:
1(100, 300) –> 3(300, 200) –> 2(200, 100) ==> which becomes a infinite loop for next insertion and thread hangs here.
java.util.HashMap.java
/**
* Associates the specified value with the specified key in this map.
* If the map previously contained a mapping for the key, the old
* value is replaced.
*
* @param key key with which the specified value is to be associated
* @param value value to be associated with the specified key
* @return the previous value associated with <tt>key</tt>, or
* <tt>null</tt> if there was no mapping for <tt>key</tt>.
* (A <tt>null</tt> return can also indicate that the map
* previously associated <tt>null</tt> with <tt>key</tt>.)
*/
public V put(K key, V value) {
if (key == null)
return putForNullKey(value);
int hash = hash(key.hashCode());
int i = indexFor(hash, table.length);
for (Entry<K,V> e = table[i]; e != null; e = e.next) {
Object k;
if (e.hash == hash && ((k = e.key) == key || key.equals(k))) {
V oldValue = e.value;
e.value = value;
e.recordAccess(this);
return oldValue;
}
}
modCount++;
addEntry(hash, key, value, i);
return null;
}
Here's the code:
regenerates the hashcode using the hash(int h) method by passing user defined hashcode as an argument.
generates an index based on the regenerated hashcode and length of the data structure.
over-rides the element if the key exists; otherwise, it will create a new entry in the HashMap at the index generated in the previous step.
Step three is straight forward, but let us dive into the methods in steps one and two, as they are slightly more complicated.
Note: These two methods are very important in order to understand the internal working functionality of HashMap in OpenJDK.
java.util.HashMap.java
/**
* Applies a supplemental hash function to a given hashCode, which
* defends against poor quality hash functions. This is critical
* because HashMap uses power-of-two length hash tables, that
* otherwise encounter collisions for hashCodes that do not differ
* in lower bits. Note: Null keys always map to hash 0, thus index 0.
*/
static int hash(int h) {
// This function ensures that hashCodes that differ only by
// constant multiples at each bit position have a bounded
// number of collisions (approximately 8 at default load factor).
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
/**
* Returns index for hash code h.
*/
static int indexFor(int h, int length) {
return h & (length-1);
}
Here, "h" is hashcode(because of its int data type, it is 32 bit) and "length" is DEFAULT_INITIAL_CAPACITY (because of its int data type, it is 32 bit).
The comments in the code block above mean that if in case the algorithm we wrote for hashcode generation does not distribute/mix lower bits evenly, it will lead to more collisions. For example, we have hashcode logic of “empId*deptId,” and if deptId is even, it would generate even hashcodes because any number multiplied by an even number is always even. If we directly depend on these hashcodes to compute the index and store our objects into hashmap then:
odd places in the hashmap are always empty.
because of #1, it would leave us to use only even places and hence double the number of collisions
For example,
I am considering some hash codes which our code might generate, which are very valid as they are different, but we will prove these to be useless soon
1111110111011010101101010111110
1100110111011010101011010111110
1100000111011010101110010111110
I am considering these sequences directly (without using hash function) and pass it for indexFor method, where we do AND operation between 'hashcode' and 'length-1(which will always give sequence of 1's as length is always power of 2)'
As we are considering the length as default length, i.e, 16, binary representation of 16-1 is 1111
this is what happens inside indexFor method
1111110111011010101101010111110 & 0000000000000000000000000001111 = 1110
1100110111011010101011010111110 & 0000000000000000000000000001111 = 1110
1100000111011010101110010111110 & 0000000000000000000000000001111 = 1110
What is a bucket and what can be the maximum number of buckets in HashMap?
A bucket is an instance of a linked list (Entry Inner Class in my previous post) , and we can have as many buckets as the length of the HashMap at maximum. For example, in a HashMap of length eight, there can be a maximum of 8 buckets; each is an instance of a linked list.
From this, we understand that all the objects with these different hashcodes would have the same index, which means they would all go into the same bucket, which is a BIG-FAIL, as it leads to an arraylist complexity O(n) instead of O(1).
Comment from above source code says…
that otherwise encounter collisions for hashCodes that do not differ in lower bits.
Notice this sequence of 0-15 (2-power-4), its the default size of Hashtable
0000 - 0 1000 - 8
0001 - 1 1001 - 9
0010 - 2 1010 - 10
0011 - 3 1011 - 11
0100 - 4 1100 - 12
0101 - 5 1101 - 13
0110 - 6 1110 - 14
0111 - 7 1111 - 15
If we notice here, the HashMap with power-of-two length 16(2^4), only the last four digits matter in the allocation of buckets, and these are the four binary lower bit digit variations that play a prominent role in identifying the right bucket.
Keeping the above sequence in mind, we regenerated the hashcode from hash(int h) by passing the existing hascode, which makes sure there is enough variation in the lower bits of the hashcode. We then pass it to the indexFor() method; this will ensure the lower bits of hashcode are used to identify the bucket, and the other higher bits are ignored.
For example, take the same hashcode sequences from the above example:
this is what happens inside indexFor method
1111110111011010101101010111110 (our hash code) when regenerated with hash(int h) method, it generates new hashcode 1111001111110011100110111011010
1100110111011010101011010111110 ==> 1100000010010000101101110011001
1100000111011010101110010111110 ==> 1100110001001000011011110001011
passing these set of new hashcodes to indexFor() method
1111001111110011100110111011010 & 0000000000000000000000000001111 = 1010
1100000010010000101101110011001 & 0000000000000000000000000001111 = 1001
1100110001001000011011110001011 & 0000000000000000000000000001111 = 1011
Here, it is clear that because of the regenerated hashcode, the lower bits are distributed/mixed, leading to a unique index, which leads to different buckets avoiding collisions.
Why only these magic numbers 20, 12, 7 and 4? It is explained in the book: “The Art of Computer Programming” by Donald Knuth.
Here, we are XORing the most significant bits of the number into the least significant bits (20, 12, 7, 4). The main purpose of this operation is to make the hashcode differences visible in the least significant bits, so that the HashMap elements can be distributed evenly across the buckets.
java.util.HashMap.java
/**
* Associates the specified value with the specified key in this map.
* If the map previously contained a mapping for the key, the old
* value is replaced.
*
* @param key key with which the specified value is to be associated
* @param value value to be associated with the specified key
* @return the previous value associated with <tt>key</tt>, or
* <tt>null</tt> if there was no mapping for <tt>key</tt>.
* (A <tt>null</tt> return can also indicate that the map
* previously associated <tt>null</tt> with <tt>key</tt>.)
*/
public V put(K key, V value) {
if (key == null)
return putForNullKey(value);
int hash = hash(key.hashCode());
int i = indexFor(hash, table.length);
for (Entry<K,V> e = table[i]; e != null; e = e.next) {
Object k;
if (e.hash == hash && ((k = e.key) == key || key.equals(k))) {
V oldValue = e.value;
e.value = value;
e.recordAccess(this);
return oldValue;
}
}
modCount++;
addEntry(hash, key, value, i);
return null;
}
Going back to previous steps:
regenerates the hashcode using the hash(int h) method by passing user-defined hashcode as an argument.
generates an index based on the re-generated hashcode and length of the data structure.
overrides the element if the key exists; otherwise, it will create a new entry in the HashMap at the index generated in step two.
In step three, we can ask...
What happens when two different keys have same hascode?
if the keys are equal, i.e, to-be-inserted key and already-inserted key’s hashcodes are same and the keys are same (via reference or via equals() method), then override the previous key-value pair with the current key-value pair.
if keys are not equal, then store the key-value pair in the same bucket as that of the existing keys.
When does a collision happen in a HashMap?
It happens in the second case in the above question.
How do you retrieve the correct value object when two keys with same hashcode are stored in a HashMap?
Using the hashcode, we go to the right bucket, and using equals, we find the right element in the bucket and then return it.
How do different keys with the same HashCode stored in hashmap?
Usually, the is in the bucket, but technically they are all stored in a single linked list. Little difference is that insertion of new element to the linked list is made at the head instead of tail to avoid tail traversal.
java.util.HashMap.java
/**
* Returns the value to which the specified key is mapped,
* or {@code null} if this map contains no mapping for the key.
*
* <p>More formally, if this map contains a mapping from a key
* {@code k} to a value {@code v} such that {@code (key==null ? k==null :
* key.equals(k))}, then this method returns {@code v}; otherwise
* it returns {@code null}. (There can be at most one such mapping.)
*
* <p>A return value of {@code null} does not <i>necessarily</i>
* indicate that the map contains no mapping for the key; it's also
* possible that the map explicitly maps the key to {@code null}.
* The {@link #containsKey containsKey} operation may be used to
* distinguish these two cases.
*
* @see #put(Object, Object)
*/
public V get(Object key) {
if (key == null)
return getForNullKey();
int hash = hash(key.hashCode());
int i = indexFor(hash, table.length);
for (Entry<K,V> e = table[i]; e != null; e = e.next) {
Object k;
if (e.hash == hash && ((k = e.key) == key || key.equals(k)))
return e.value;
}
return null;
}
re-generates the hashcode using hash(int h) method by passing user defined hashcode as an argument
generates index based on the re-generated hashcode and length of the data structure.
point to the right bucket, i.e, table[i], and traverse through the linked list, which is constructed based on Entry inner class
when keys are equal and their hashcodes are equal then return the value mapped to that key
Originally published on 7/1/15
Opinions expressed by DZone contributors are their own.
Comments