Optimize Hashing Java

Question

I'm having a big amount of large lists of objects. Each object has a unique id. It looks something like this:

List a = {obj1, obj2, obj3}
List b = {obj3, obj4, obj5}
List c = {obj1, obj2, obj3}
// up to 100 million of them

Now I'd like to remove "List c" since it has the same content as "List a" in order to save memory.

For this purpose I'm simply adding them all to a hashmap and check if the key already exists. The objects are actually references in a large network graph. If only one is wrong the whole application crashs. Because it is very important that there will never be the same key for different objects I don't use the default

List.hashCode()

function but do this instead:

StringBuilder sb = new StringBuilder();
  for ( List list : myList )
    sb.append(list.getId());
return Hashing.sha256().hashString(sb.toString(), Charsets.US_ASCII).toString();

This works perfectly fine. Just it is very slow. Is there any way to achieve the same result in less time?

Did you try with the default hashcode of your list ? java.util.AbstractList compute a hash from each objecft in the list. toString is a slow operation and it is not needed. If the default hashcode of the list is too slow you should have a look at the hashcode of the object in the list. — Nicolas Henneaux
– Nicolas Henneaux, Commented Aug 5, 2016 at 13:50
I'm not following why you think that Lists' hashCode() implementation does not serve your purpose. — John Bollinger
– John Bollinger, Commented Aug 5, 2016 at 13:50
Because it is very important that there will never be the same key for different objects: Why is that so important to you? Obviously a SHA256 hash will be very slow :) — sstan
– sstan, Commented Aug 5, 2016 at 13:50
Do you mean that lists whose elements differ must be guaranteed to have different hash codes (i.e. you want a perfect hash)? That cannot be guaranteed at the level of abstraction of your question. In particular, your existing implementation does not guarantee it. — John Bollinger
– John Bollinger, Commented Aug 5, 2016 at 13:53
In fact, if you are computing int hash codes for 100M distinct objects, then you are consuming around 2% of all the available hash codes. The technique you describe has a reasonably high probability of producing a few hash collisions in that case. — John Bollinger
– John Bollinger, Commented Aug 5, 2016 at 13:56

Jean Logeart · Accepted Answer · 2016-08-05 14:00:49Z

4

Use a HashSet and the regular hashcode and methods from List to remove duplicates. Their implementations are similar to your idea.

So:

Set<List<String>> uniques = 
    new HashSet<>(Arrays.List<String>asList(a, b, c));  // {a, b}

edited Aug 5, 2016 at 14:00

answered Aug 5, 2016 at 13:53

Jean Logeart

54.1k11 gold badges88 silver badges120 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Yojimbo Over a year ago

Sorry I don't get it. When I use the default hashcode method of List I get an int. Having 100 million objects the probabilty of a collision is very high since the range of int is only around 4 billion. It is crucial to avoid collisions.

Jean Logeart Over a year ago

That's when equals come into play: if 2 lists end up with the same hashcode, equality is checked.

JavaHopper Over a year ago

Yes! And remember that it is efficient because equals method is only invoked when there is a collision

Yojimbo Over a year ago

The problem is, that out of my 100 million 90% needs will be removed. So the equals method will be called almost all the time. Performance even slows down then.

Jean Logeart Over a year ago

Yes but whatever hash function you use, you still want to check equality anyway for those 90% anyway. At most 10% of the time will the hash tell you that the list is unique. So you are trying to optimize something that is not your bottleneck. Try it first.

Collectives™ on Stack Overflow

Optimize Hashing Java

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related