7

I have a quite peculiar case here. I have a file containing several million entries and want to find out if there exists at least one duplicate. The language here isn't of great importance, but C seems like a reasonable choice for speed. Now, what I want to know is what kind of approach to take to this? Speed is the primary goal here. Naturally, we want to stop looking as soon as one duplicate is found, that's clear, but when the data comes in, I don't know anything about how it's sorted. I just know it's a file of strings, separated by newline. Now keep in mind, all I want to find out is if a duplicate exists. Now, I have found a lot of SO questions regarding finding all duplicates in an array, but most of them go the easy and comprehensive way, rather than the fastest.

Hence, I'm wondering: what is the fastest way to find out if an array contains at least one duplicate? So far, the closest I've been able to find on SO is this: Finding out the duplicate element in an array. The language chosen isn't important, but since it is, after all, programming, multi-threading would be a possibility (I'm just not sure if that's a feasible way to go about it).

Finally, the strings have a format of XXXNNN (3 characters and 3 integers).

Please note that this is not strictly theoretical. It will be tested on a machine (Intel i7 with 8GB RAM), so I do have to take into consideration the time of making a string comparison etc. Which is why I'm also wondering if it could be faster to split the strings in two, and first compare the integer part, as an int comparison will be quicker, and then the string part? Of course, that will also require me to split the string and cast the second half to an int, which might be slower...

27
  • 1
    If the data isn't sorted, you'll have to search through it from start to end and compare each item, no way around it. Depending on how much you are going to use the data, it may or may not be worth sorting it first. Commented Oct 14, 2016 at 13:47
  • 3
    You don't need to sort to find duplicates. You could insert each string in a hash table and check for duplicates on collisions. Expected complexity for hash lookup and insertion is O(1) and therefore you can have a duplicate-detection algorithm that runs in expected O(N). In fact, I'd be very surprised if such an approach is not featured in some other question. Commented Oct 14, 2016 at 13:51
  • 1
    You can, in a single pass, assign to each string an integer value between 0 and 255 (a kind of signature, like a checksum mod 256) and create an array of 256 lists of integers and place in front of these entries the line numbers having same signature. This helps group which lines may be identical. Commented Oct 14, 2016 at 14:02
  • 1
    Yes, sorry, 3 digits, base 10. @chux 6 million, known Commented Oct 14, 2016 at 14:49
  • 2
    Hmmmm if 26 different characters and 10 different digits make 26*26*26*10*18*10 combinations, why not use a 17,576,000 bit table (2,197,000 bytes)? Commented Oct 14, 2016 at 14:56

6 Answers 6

7

Finally, the strings have a format of XXXNNN (3 characters and 3 integers).

Knowing your key domain is essential to this sort of problem, so this allows us to massively simplify the solution (and this answer).

If X ∈ {A..Z} and N ∈ {0..9}, that gives 263 * 103 = 17,576,000 possible values ... a bitset (essentially a trivial, perfect Bloom filter with no false positives) would take ~2Mb for this.


Here you go: a python script to generate all possible 17 million keys:

import itertools
from string import ascii_uppercase

for prefix in itertools.product(ascii_uppercase, repeat=3):
    for numeric in range(1000):
        print "%s%03d" % (''.join(prefix), numeric)   

and a simple C bitset filter:

#include <limits.h>
/* convert number of bits into number of bytes */
int filterByteSize(int max) {
    return (max + CHAR_BIT - 1) / CHAR_BIT;
}
/* set bit #value in the filter, returning non-zero if it was already set */
int filterTestAndSet(unsigned char *filter, int value) {
    int byteIndex = value / CHAR_BIT;
    unsigned char mask = 1 << (value % CHAR_BIT);

    unsigned char byte = filter[byteIndex];
    filter[byteIndex] = byte | mask;

    return byte & mask;
}

which for your purposes you'd use like so:

#include <stdlib.h>
/* allocate filter suitable for this question */
unsigned char *allocMyFilter() {
    int maxKey = 26 * 26 * 26 * 10 * 10 * 10;
    return calloc(filterByteSize(maxKey), 1);
}
/* key conversion - yes, it's horrible */
int testAndSetMyKey(unsigned char *filter, char *s) {
    int alpha   = s[0]-'A' + 26*(s[1]-'A' + 26*(s[2]-'A'));
    int numeric = s[3]-'0' + 10*(s[4]-'0' + 10*(s[5]-'0'));
    int key = numeric + 1000 * alpha;
    return filterTestAndSet(filter, key);
}

#include <stdio.h>
int main() {
    unsigned char *filter = allocMyFilter();
    char key[8]; /* 6 chars + newline + nul */
    while (fgets(key, sizeof(key), stdin)) {
        if (testAndSetMyKey(filter, key)) {
            printf("collision: %s\n", key);
            return 1;
        }
    }
    return 0;
}

This is linear, although there's obviously scope to optimise the key conversion and file input. Anyway, sample run:

useless:~/Source/40044744 $ python filter_test.py > filter_ok.txt
useless:~/Source/40044744 $ time ./filter < filter_ok.txt

real    0m0.474s
user    0m0.436s
sys 0m0.036s

useless:~/Source/40044744 $ cat filter_ok.txt filter_ok.txt > filter_fail.txt
useless:~/Source/40044744 $ time ./filter < filter_fail.txt
collision: AAA000

real    0m0.467s
user    0m0.452s
sys 0m0.016s

admittedly the input file is cached in memory for these runs.

Sign up to request clarification or add additional context in comments.

2 Comments

Nice linear programming solution to the problem, using a 2MB table to avoid complex data structures or algorithms. The performance looks like n since a given input the bloom filter needs to be checked only once per element with a constant time lookup.
Yep, absolutely. I charted a quick experiment over 1..7 million input rows, to check nothing untoward was happening, and up to measurement jitter it was straight as an arrow.
4

The reasonable answer is to keep the algorithm with the smallest complexity. I encourage you to use a HashTable to keep track of inserted elements; the final algorithm complexity is O(n), because search in HashTable is O(1) theoretically. In your case I suggest you, to run the algorithm when reading file.

public static bool ThereAreDuplicates(string[] inputs)
        {
            var hashTable = new Hashtable();
            foreach (var input in inputs)
            {
                if (hashTable[input] != null)
                    return true;

                hashTable.Add(input, string.Empty);
            }
            return false;
        }

3 Comments

Is hashTable.Add() linear time? I suspect it would need to re-size itself from time to time if the number of item was not known prior to making the hash table.
It's amortized average-case constant time. Some operations will be linear time, but so long as those happen with inverse-linear frequency, it amortizes out. And, some hash/key combinations will cause lots of collisions and degrade from constant to linear time again, but on average we hope that doesn't happen too often.
Agree about amortized average-case constant time, etc. Note: OP later commented the file size was known at the beginning.
3

A fast but inefficient memory solution would use

// Entries are AAA####
char found[(size_t)36*36*36*36*36*36 /* 2,176,782,336 */] = { 0 };  // or calloc() this
char buffer[100];

while (fgets(buffer, sizeof buffer, istream)) {
  unsigned long index = strtoul(buffer, NULL, 36);
  if (found[index]++) {
    Dupe_found();
    break;
  }
}

The trouble with the post is that it wants "Fastest algorithm", but does not detail memory concerns and its relative importance to speed. So speed must be king and the above wastes little time. It does meet the "stop looking as soon as one duplicate is found" requirement.

Comments

2

Depending on how many different things there can be you have some options:

  • Sort whole array and then lookup for repeating element, complexity O(n log n) but can be done in place, so memory will be O(1)
  • Build set of all elements. Depending on chosen set implementation can be O(n) (when it will be hash set) or O(n log n) (binary tree), but it would cost you some memory to do so.

Comments

2

The fastest way to find out if an array contains at least one duplicate is to use a bitmap, multiple CPUs and an (atomic or not) "test and set bit" instruction (e.g. lock bts on 80x86).

The general idea is to divide the array into "total elements / number of CPUs" sized pieces and give each piece to a different CPU. Each CPU processes it's piece of the array by calculating an integer and doing the atomic "test and set bit" for the bit corresponding to that integer.

However, the problem with this approach is that you're modifying something that all CPUs are using (the bitmap). A better idea is to give each CPU a range of integers (e.g. CPU number N does all integers from "(min - max) * N / CPUs" to "(min - max) * (N+1) / CPUs"). This means that all CPUs read from the entire array, but each CPU only modifies it's own private piece of the bitmap. This avoids some performance problems involved with cache coherency protocols ("read for ownership of cache line") and also avoids the need for atomic instructions.

Then next step beyond that is to look at how you're converting your "3 characters and 3 digits" strings into an integer. Ideally, this can/would be done using SIMD; which would require that the array is in "structure of arrays" format (and not the more likely "array of structures" format). Also note that you can convert the strings to integers first (in an "each CPU does a subset of the strings" way) to avoid the need for each CPU to convert each string and pack more into each cache line.

8 Comments

I'm not sure I see how this would work with multiple CPUs... Firstly, I'm wasting cycles just splitting it up and starting processes. Now, you then state CPU N doing all integers in a given range. However, would that not require sorted input data?
@Phroggyy: Starting threads is a "once only" cost. For huge arrays that cost is insignificant (dwarfed by the cost of processing a huge number of entries). For tiny arrays the cost of starting threads would be significant, but you wouldn't care about performance for tiny arrays in the first place. If you're expecting to have to handle everything from tiny arrays to huge arrays you can adjust the number of threads (e.g. 1 thread only if array is tiny, 2 threads if array is medium sized, .... up to the number of CPUs you have).
Right @Brendan but is 6M really that huge? I have my current program running in 0.36s, and the slowest line of code is filter[byteIndex] = byte | mask, which is why I'm wondering if a second, or even third, thread is worth the cost
@Phroggyy: It doesn't need the input to be sorted - if a CPU/thread is only handling integers within a certain range then it'd read from the array sequentially (a nice "cache friendly"/pre-fetchable access pattern) and simply skip over any integers that are outside that range.
Right gotcha, so each process would be dealing with its own subset
|
2

Since you have several million entries I think the best algorithm would be counting sort. Counting sort does exactly what you asked: it sorts an array by counting how many times every element exists. So you could write a function that does the counting sort to the array :

void counting_sort(int a[],int n,int max)
{
     int count[max+1]={0},i;

     for(i=0;i<n;++i){
      count[a[i]]++;
       if (count[a[i]]>=2) return 1;
      }
      return 0;

}

Where you should first find the max element (in O(n)). The asymptotic time complexity of counting sort is O(max(n,M)) where M is the max value found in the array. So because you have several million entries if M has size order of some millions this will work in O(n) (or less for counting sort but because you need to find M it is O(n)). If also you know that there is no way that M is greater than some millions than you would be sure that this gives O(n) and not just O(max(n,M)).

You can see counting sort visualization to understand it better, here: https://www.cs.usfca.edu/~galles/visualization/CountingSort.html

Note that in the above function we don't implement exactly counting sort, we stop when we find a duplicate which is even more efficient, since yo only want to know if there is a duplicate.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.