removing duplicate strings from a massive array in java efficiently?

Question

I'm considering the best possible way to remove duplicates from an (Unsorted) array of strings - the array contains millions or tens of millions of stringz..The array is already prepopulated so the optimization goal is only on removing dups and not preventing dups from initially populating!!

I was thinking along the lines of doing a sort and then binary search to get a log(n) search instead of n (linear) search. This would give me nlogn + n searches which althout is better than an unsorted (n^2) search = but this still seems slow. (Was also considering along the lines of hashing but not sure about the throughput)

Please help! Looking for an efficient solution that addresses both speed and memory since there are millions of strings involved without using Collections API!

All problems of massive something with time and space efficiency seem to be solved by hashing these days. If they did not want you to use collections API, I suspect they want you to describe a hashing function on your own. — Miserable Variable
– Miserable Variable, Commented Apr 6, 2012 at 23:59

Jon Skeet · Accepted Answer · 2012-04-06 15:31:59Z

7

Until your last sentence, the answer seemed obvious to me: use a HashSet<String> or a LinkedHashSet<String> if you need to preserve order:

HashSet<String> distinctStrings = new HashSet<String>(Arrays.asList(array));

If you can't use the collections API, consider building your own hash set... but until you've given a reason why you wouldn't want to use the collections API, it's hard to give a more concrete answer, as that reason could rule out other answers too.

answered Apr 6, 2012 at 15:31

Jon Skeet

1.5m893 gold badges9.3k silver badges9.3k bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Preator Darmatheon Over a year ago

Good question - this was an itview question that i had been asked. I had proposed the quiksort + adjacent compare but that wasn't good enough for them. I'm pretty sure they're right - I was hoping to get op of the folks here on what would be even better than nlogn + n?

Jon Skeet Over a year ago

@PreatorDarmatheon: Building a hash set would probably be O(n) assuming a reasonable implementation and low collisions. But please give the context in the future.

Preator Darmatheon Over a year ago

I see- by reasonable - what pitfalls are you suggesting if the implementation strategy is flawed? Any good resource for building such a hashset for the criteria I'm facing?

Jon Skeet Over a year ago

@PreatorDarmatheon: All kinds of things could go wrong if you implement it badly, of course. I'd look up hash tables on Wikipedia if I were you. But it's unlikely that you'd ever want to actually implement it yourself these days - you'd use someone else's implementation. The important point is to know that it's the right approach.

kasavbere Over a year ago

The whole point of saying no collections API is because they don't want hashing because it's way too expensive here.

|

Eugene Retunsky · Accepted Answer · 2012-04-07 01:19:16Z

5

ANALYSIS

Let's perform some analysis:

Using HashSet. Time complexity - O(n). Space complexity O(n). Note, that it requires about 8 * array size bytes (8-16 bytes - a reference to a new object).
Quick Sort. Time - O(n*log n). Space O(log n) (the worst case O(n*n) and O(n) respectively).
Merge Sort (binary tree/TreeSet). Time - O(n * log n). Space O(n)
Heap Sort. Time O(n * log n). Space O(1). (but it is slower than 2 and 3).

In case of Heap Sort you can through away duplicates on fly, so you'll save a final pass after sorting.

CONCLUSION

If time is your concern, and you don't mind allocating 8 * array.length bytes for a HashSet - this solution seems to be optimal.
If space is a concern - then QuickSort + one pass.
If space is a big concern - implement a Heap with throwing away duplicates on fly. It's still O(n * log n) but without additional space.

edited Apr 7, 2012 at 1:19

answered Apr 6, 2012 at 16:18

Eugene Retunsky

13.2k5 gold badges54 silver badges55 bronze badges

2 Comments

kasavbere Over a year ago

Good, except for the heap idea. throwing away duplicates on the fly. Well, really?

Eugene Retunsky Over a year ago

When the heap is built, and you taking the largest from the top, if it equals the previous largest, do not prepend it to the result array.

Michael Schmeißer · Accepted Answer · 2012-04-06 15:42:28Z

2

I would suggest that you use a modified mergesort on the array. Within the merge step, add logic to remove duplicate values. This solution is n*log(n) complexity and could be performed in-place if needed (in this case in-place implementation is a bit harder than with normal mergesort because adjacent parts could contain gaps from the removed duplicates which also need to be closed when merging).

For more information on mergesort see http://en.wikipedia.org/wiki/Merge_sort

answered Apr 6, 2012 at 15:42

Michael Schmeißer

3,4271 gold badge21 silver badges32 bronze badges

Comments

user949300 · Accepted Answer · 2012-04-06 23:41:35Z

1

Creating a hashset to handle this task is way too expensive. Demonstrably, in fact the whole point of them telling you not to use the Collections API is because they don't want to hear the word hash. So that leaves the code following.

Note that you offered them binary search AFTER sorting the array: that makes no sense, which may be the reason your proposal was rejected.

OPTION 1:

public static void removeDuplicates(String[] input){
    Arrays.sort(input);//Use mergesort/quicksort here: n log n
    for(int i=1; i<input.length; i++){
        if(input[i-1] == input[i])
            input[i-1]=null;
    }       
}

OPTION 2:

public static String[] removeDuplicates(String[] input){
    Arrays.sort(input);//Use mergesort here: n log n
    int size = 1;
    for(int i=1; i<input.length; i++){
        if(input[i-1] != input[i])
            size++;
    }
    System.out.println(size);
    String output[] = new String[size];
    output[0]=input[0];
    int n=1;
    for(int i=1;i<input.length;i++)
        if(input[i-1]!=input[i])
            output[n++]=input[i];
    //final step: either return output or copy output into input; 
    //here I just return output
    return output;
}

OPTION 3: (added by 949300, based upon Option 1). Note that this mangles the input array, if that is unacceptable, you must make a copy.

public static String[] removeDuplicates(String[] input){
    Arrays.sort(input);//Use mergesort/quicksort here: n log n
    int outputLength = 0;
    for(int i=1; i<input.length; i++){
        // I think equals is safer, but are nulls allowed in the input???
        if(input[i-1].equals(input[i]))
            input[i-1]=null;
        else
           outputLength++;
    }  

    // check if there were zero duplicates
    if (outputLength == input.length)
       return input;

    String[] output = new String[outputLength];
    int idx = 0;
    for ( int i=1; i<input.length; i++) 
       if (input[i] != null)
          output[idx++] = input[i]; 

    return output;   
}

edited Apr 6, 2012 at 23:41

user949300

15.7k7 gold badges39 silver badges69 bronze badges

answered Apr 6, 2012 at 23:13

kasavbere

6,02314 gold badges52 silver badges73 bronze badges

8 Comments

user949300 Over a year ago

I like this general approach, though, for safety, I'd use equals() instead of ==. See edited Option 3.

kasavbere Over a year ago

Of cousre! I first wrote it with int[] because it was easier to test. Will edit

user949300 Over a year ago

check my edited Option 3, which is based upon your Option 1/2 but only does the comparison loop once.

user949300 Over a year ago

One idea for a speedup - do the quicksort based upon the hashcode of the string, much faster than the actual String. But then the loop to compare adjacent elements is much much trickier.

Jon Skeet Over a year ago

Unless you were one of the interviewers, can you say what makes you believe so strongly that they "don't want to hear the word hash"?

|

Jonny Schubert · Accepted Answer · 2012-04-06 15:42:15Z

0

Hi do you need to put them into an array. It would be faster to use a collection using hash values like a set. Here each value is unique because of its hash value.

If you put all entries to a set collection type. You can use the

 HashSet(int initialCapacity)

constructor to prevent memory expansion while run time.

  Set<T> mySet = new HashSet<T>(Arrays.asList(someArray))

Arrays.asList() has runtime O(n) if memory do not have to be expanded.

edited Apr 6, 2012 at 15:42

answered Apr 6, 2012 at 15:34

Jonny Schubert

1,4832 gold badges20 silver badges42 bronze badges

Comments

evanwong · Accepted Answer · 2012-04-06 16:11:42Z

0

Since this is an interview question, I think they want you to come up with your own implementation instead of using the set api.

Instead of sorting it first and compare it again, you can build a binary tree and create an empty array to store the result.

The first element in the array will be the root.

If the next element is equals to the node, return. -> this remove the duplicate elements
If the next element is less than the node, compare it to the left, else compare it to the right.

Keep doing the above the 2 steps until you reach to the end of the tree, then you can create a new node and know this has no duplicate yet. Insert this new node value to the array.

After the traverse of all elements of the original array, you get a new copy of an array with no duplicate in the original order.

Traversing takes O(n) and searching the binary tree takes O(logn) (insertion should only take O(1) since you are just attaching it and not re-allocating/balancing the tree) so the total should be O(nlogn).

answered Apr 6, 2012 at 16:11

evanwong

5,1543 gold badges34 silver badges45 bronze badges

2 Comments

kasavbere Over a year ago

insertion should only take O(1) in what world?! I am NOT down voting this. But think about it.

evanwong Over a year ago

Yes, in a binary search tree, average insertion should take o(logn). This insertion O(logn) is actually because it begins with the search. My suggestion was saying that the search already took place of O(logn) to find the right node, so the actual insertion is just attaching the new node either to left, or right of the node. Isn't this just O(1)?

user949300 · Accepted Answer · 2012-04-07 00:01:15Z

O.K., if they want super speed, let's use the hashcodes of the Strings as much as possible.

Loop through the array, get the hashcode for each String, and add it to your favorite data structure. Since you aren't allowed to use a Collection, use a BitSet. Note that you need two, one for positives and one for negatives, and they will each be huge.
Loop again through the array, with another BitSet. True means the String passes. If the hashcode for the String does not exist in the Bitset, you can just mark it as true. Else, mark it as possibly duplicate, as false. While you are at it, count how many possible duplicates.
Collect all the possible duplicates into a big String[], named possibleDuplicates. Sort it.
Now go through the possible duplicates in the original array and binary Search in the possibleDuplicates. If present, well, you are still stuck, cause you want to include it ONCE but not all the other times. So you need yet another array somewhere. Messy, and I've got to go eat dinner, but this is a start...

Collectives™ on Stack Overflow

removing duplicate strings from a massive array in java efficiently?

7 Answers 7

7 Comments

2 Comments

Comments

8 Comments

Comments

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

7 Comments

2 Comments

Comments

8 Comments

Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related