What's Wrong With Hashcode in java.lang.String?
This overview of strings and hashcode covers how hashcode works, how collisions happen, and the danger of improperly using strings as keys.
Join the DZone community and get the full member experience.
Join For FreeOne of the most significant criteria of every hash function is the tendency for collisions. Hash functions inside the JDK are not the exception. The main idea of a collision attack is finding two different messages, m1 and m2, such that hash(m1) = hash(m2). In this article, I would like to show that this problem is reproducible in every program written in Java and how to get around it.
Firstly, let's consider the internals of the hashcode function in java.lang.String:
public int hashCode() {
int h = hash;
if (h == 0 && value.length > 0) {
char val[] = value;
for (int i = 0; i < value.length; i++) {
h = 31 * h + val[i];
}
hash = h;
}
return h;
}
As you see hashcode in java.lang.String class is computed as:
s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
This is nothing more than a polynomial accumulation method. The multiplier of this function is 31 and it has its own rationale: Joshua Bloch in Effective Java explained this solution: "The value 31 was chosen because it is an odd prime. If it were even and the multiplication overflowed, information would be lost, as multiplication by 2 is equivalent to shifting. The advantage of using a prime is less clear, but it is traditional. A nice property of 31 is that the multiplication can be replaced by a shift and a subtraction for better performance: 31 * i == (i << 5) - i. Modern VMs do this sort of optimization automatically." (from Chapter 3, Item 9: Always override hashcode when you override equals, page 48).
The collisions in hashcode are generated trivially:
If String a and String b have a common prefix and the same length — if n and the statement 31*(b[n-2] - a[n-2]) == (a[n-1] - b[n-1]) is true — it means that first and second strings have the same hashcode.
Let's consider this example:
String a = "Aa";
String b = "BB";
System.out.println(a.hashCode());
System.out.println(b.hashCode());
System.out.println(31 * ('C' - 'D') == ('B' - 'a'));
System.out.println(31 * ('B' - 'A') == ('a' - 'B'));
System.out.println("common_prefixDB".hashCode());
System.out.println("common_prefixCa".hashCode());
If you run this example, you will see the same hashcode for "Aa" and "BB" — 2112 — and for the two last strings with "common_prefix" — 1027514780. You can generate your own collisions in a simple loop statement.
With this fact, it follows that the usage of a string like a key in HashMap is the mistake — for instance, if the programmer used a String key in a HashMap for storing HTTP headers, the attacker can use it for DoS attacks.
As a conclusion, be careful with hashcode functions.
Opinions expressed by DZone contributors are their own.
Comments