1

I'm having a big amount of large lists of objects. Each object has a unique id. It looks something like this:

List a = {obj1, obj2, obj3}
List b = {obj3, obj4, obj5}
List c = {obj1, obj2, obj3}
// up to 100 million of them

Now I'd like to remove "List c" since it has the same content as "List a" in order to save memory.

For this purpose I'm simply adding them all to a hashmap and check if the key already exists. The objects are actually references in a large network graph. If only one is wrong the whole application crashs. Because it is very important that there will never be the same key for different objects I don't use the default

List.hashCode()

function but do this instead:

StringBuilder sb = new StringBuilder();
  for ( List list : myList )
    sb.append(list.getId());
return Hashing.sha256().hashString(sb.toString(), Charsets.US_ASCII).toString();

This works perfectly fine. Just it is very slow. Is there any way to achieve the same result in less time?

7
  • Did you try with the default hashcode of your list ? java.util.AbstractList compute a hash from each objecft in the list. toString is a slow operation and it is not needed. If the default hashcode of the list is too slow you should have a look at the hashcode of the object in the list. Commented Aug 5, 2016 at 13:50
  • I'm not following why you think that Lists' hashCode() implementation does not serve your purpose. Commented Aug 5, 2016 at 13:50
  • 1
    Because it is very important that there will never be the same key for different objects: Why is that so important to you? Obviously a SHA256 hash will be very slow :) Commented Aug 5, 2016 at 13:50
  • Do you mean that lists whose elements differ must be guaranteed to have different hash codes (i.e. you want a perfect hash)? That cannot be guaranteed at the level of abstraction of your question. In particular, your existing implementation does not guarantee it. Commented Aug 5, 2016 at 13:53
  • In fact, if you are computing int hash codes for 100M distinct objects, then you are consuming around 2% of all the available hash codes. The technique you describe has a reasonably high probability of producing a few hash collisions in that case. Commented Aug 5, 2016 at 13:56

1 Answer 1

4

Use a HashSet and the regular hashcode and methods from List to remove duplicates. Their implementations are similar to your idea.

So:

Set<List<String>> uniques = 
    new HashSet<>(Arrays.List<String>asList(a, b, c));  // {a, b}
Sign up to request clarification or add additional context in comments.

5 Comments

Sorry I don't get it. When I use the default hashcode method of List I get an int. Having 100 million objects the probabilty of a collision is very high since the range of int is only around 4 billion. It is crucial to avoid collisions.
That's when equals come into play: if 2 lists end up with the same hashcode, equality is checked.
Yes! And remember that it is efficient because equals method is only invoked when there is a collision
The problem is, that out of my 100 million 90% needs will be removed. So the equals method will be called almost all the time. Performance even slows down then.
Yes but whatever hash function you use, you still want to check equality anyway for those 90% anyway. At most 10% of the time will the hash tell you that the list is unique. So you are trying to optimize something that is not your bottleneck. Try it first.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.