5

I'm considering the best possible way to remove duplicates from an (Unsorted) array of strings - the array contains millions or tens of millions of stringz..The array is already prepopulated so the optimization goal is only on removing dups and not preventing dups from initially populating!!

I was thinking along the lines of doing a sort and then binary search to get a log(n) search instead of n (linear) search. This would give me nlogn + n searches which althout is better than an unsorted (n^2) search = but this still seems slow. (Was also considering along the lines of hashing but not sure about the throughput)

Please help! Looking for an efficient solution that addresses both speed and memory since there are millions of strings involved without using Collections API!

2
  • 2
    Why don't you want to use the collections API? Commented Apr 6, 2012 at 15:30
  • 1
    All problems of massive something with time and space efficiency seem to be solved by hashing these days. If they did not want you to use collections API, I suspect they want you to describe a hashing function on your own. Commented Apr 6, 2012 at 23:59

7 Answers 7

7

Until your last sentence, the answer seemed obvious to me: use a HashSet<String> or a LinkedHashSet<String> if you need to preserve order:

HashSet<String> distinctStrings = new HashSet<String>(Arrays.asList(array));

If you can't use the collections API, consider building your own hash set... but until you've given a reason why you wouldn't want to use the collections API, it's hard to give a more concrete answer, as that reason could rule out other answers too.

Sign up to request clarification or add additional context in comments.

7 Comments

Good question - this was an itview question that i had been asked. I had proposed the quiksort + adjacent compare but that wasn't good enough for them. I'm pretty sure they're right - I was hoping to get op of the folks here on what would be even better than nlogn + n?
@PreatorDarmatheon: Building a hash set would probably be O(n) assuming a reasonable implementation and low collisions. But please give the context in the future.
I see- by reasonable - what pitfalls are you suggesting if the implementation strategy is flawed? Any good resource for building such a hashset for the criteria I'm facing?
@PreatorDarmatheon: All kinds of things could go wrong if you implement it badly, of course. I'd look up hash tables on Wikipedia if I were you. But it's unlikely that you'd ever want to actually implement it yourself these days - you'd use someone else's implementation. The important point is to know that it's the right approach.
The whole point of saying no collections API is because they don't want hashing because it's way too expensive here.
|
5

ANALYSIS

Let's perform some analysis:

  1. Using HashSet. Time complexity - O(n). Space complexity O(n). Note, that it requires about 8 * array size bytes (8-16 bytes - a reference to a new object).

  2. Quick Sort. Time - O(n*log n). Space O(log n) (the worst case O(n*n) and O(n) respectively).

  3. Merge Sort (binary tree/TreeSet). Time - O(n * log n). Space O(n)

  4. Heap Sort. Time O(n * log n). Space O(1). (but it is slower than 2 and 3).

In case of Heap Sort you can through away duplicates on fly, so you'll save a final pass after sorting.

CONCLUSION

  1. If time is your concern, and you don't mind allocating 8 * array.length bytes for a HashSet - this solution seems to be optimal.

  2. If space is a concern - then QuickSort + one pass.

  3. If space is a big concern - implement a Heap with throwing away duplicates on fly. It's still O(n * log n) but without additional space.

2 Comments

Good, except for the heap idea. throwing away duplicates on the fly. Well, really?
When the heap is built, and you taking the largest from the top, if it equals the previous largest, do not prepend it to the result array.
2

I would suggest that you use a modified mergesort on the array. Within the merge step, add logic to remove duplicate values. This solution is n*log(n) complexity and could be performed in-place if needed (in this case in-place implementation is a bit harder than with normal mergesort because adjacent parts could contain gaps from the removed duplicates which also need to be closed when merging).

For more information on mergesort see http://en.wikipedia.org/wiki/Merge_sort

Comments

1

Creating a hashset to handle this task is way too expensive. Demonstrably, in fact the whole point of them telling you not to use the Collections API is because they don't want to hear the word hash. So that leaves the code following.

Note that you offered them binary search AFTER sorting the array: that makes no sense, which may be the reason your proposal was rejected.

OPTION 1:

public static void removeDuplicates(String[] input){
    Arrays.sort(input);//Use mergesort/quicksort here: n log n
    for(int i=1; i<input.length; i++){
        if(input[i-1] == input[i])
            input[i-1]=null;
    }       
}

OPTION 2:

public static String[] removeDuplicates(String[] input){
    Arrays.sort(input);//Use mergesort here: n log n
    int size = 1;
    for(int i=1; i<input.length; i++){
        if(input[i-1] != input[i])
            size++;
    }
    System.out.println(size);
    String output[] = new String[size];
    output[0]=input[0];
    int n=1;
    for(int i=1;i<input.length;i++)
        if(input[i-1]!=input[i])
            output[n++]=input[i];
    //final step: either return output or copy output into input; 
    //here I just return output
    return output;
}

OPTION 3: (added by 949300, based upon Option 1). Note that this mangles the input array, if that is unacceptable, you must make a copy.

public static String[] removeDuplicates(String[] input){
    Arrays.sort(input);//Use mergesort/quicksort here: n log n
    int outputLength = 0;
    for(int i=1; i<input.length; i++){
        // I think equals is safer, but are nulls allowed in the input???
        if(input[i-1].equals(input[i]))
            input[i-1]=null;
        else
           outputLength++;
    }  

    // check if there were zero duplicates
    if (outputLength == input.length)
       return input;

    String[] output = new String[outputLength];
    int idx = 0;
    for ( int i=1; i<input.length; i++) 
       if (input[i] != null)
          output[idx++] = input[i]; 

    return output;   
}

8 Comments

I like this general approach, though, for safety, I'd use equals() instead of ==. See edited Option 3.
Of cousre! I first wrote it with int[] because it was easier to test. Will edit
check my edited Option 3, which is based upon your Option 1/2 but only does the comparison loop once.
One idea for a speedup - do the quicksort based upon the hashcode of the string, much faster than the actual String. But then the loop to compare adjacent elements is much much trickier.
Unless you were one of the interviewers, can you say what makes you believe so strongly that they "don't want to hear the word hash"?
|
0

Hi do you need to put them into an array. It would be faster to use a collection using hash values like a set. Here each value is unique because of its hash value.

If you put all entries to a set collection type. You can use the

 HashSet(int initialCapacity) 

constructor to prevent memory expansion while run time.

  Set<T> mySet = new HashSet<T>(Arrays.asList(someArray))

Arrays.asList() has runtime O(n) if memory do not have to be expanded.

Comments

0

Since this is an interview question, I think they want you to come up with your own implementation instead of using the set api.

Instead of sorting it first and compare it again, you can build a binary tree and create an empty array to store the result.

The first element in the array will be the root.

  1. If the next element is equals to the node, return. -> this remove the duplicate elements

  2. If the next element is less than the node, compare it to the left, else compare it to the right.

Keep doing the above the 2 steps until you reach to the end of the tree, then you can create a new node and know this has no duplicate yet. Insert this new node value to the array.

After the traverse of all elements of the original array, you get a new copy of an array with no duplicate in the original order.

Traversing takes O(n) and searching the binary tree takes O(logn) (insertion should only take O(1) since you are just attaching it and not re-allocating/balancing the tree) so the total should be O(nlogn).

2 Comments

insertion should only take O(1) in what world?! I am NOT down voting this. But think about it.
Yes, in a binary search tree, average insertion should take o(logn). This insertion O(logn) is actually because it begins with the search. My suggestion was saying that the search already took place of O(logn) to find the right node, so the actual insertion is just attaching the new node either to left, or right of the node. Isn't this just O(1)?
0

O.K., if they want super speed, let's use the hashcodes of the Strings as much as possible.

  1. Loop through the array, get the hashcode for each String, and add it to your favorite data structure. Since you aren't allowed to use a Collection, use a BitSet. Note that you need two, one for positives and one for negatives, and they will each be huge.

  2. Loop again through the array, with another BitSet. True means the String passes. If the hashcode for the String does not exist in the Bitset, you can just mark it as true. Else, mark it as possibly duplicate, as false. While you are at it, count how many possible duplicates.

  3. Collect all the possible duplicates into a big String[], named possibleDuplicates. Sort it.

  4. Now go through the possible duplicates in the original array and binary Search in the possibleDuplicates. If present, well, you are still stuck, cause you want to include it ONCE but not all the other times. So you need yet another array somewhere. Messy, and I've got to go eat dinner, but this is a start...

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.