How do I remove duplicate strings from an array in C?

Question

I have an array of strings in C and an integer indicating how many strings are in the array.

char *strarray[MAX];  
int strcount;

In this array, the highest index (where 10 is higher than 0) is the most recent item added and the lowest index is the most distant item added. The order of items within the array matters.

I need a quick way to check the array for duplicates, remove all but the highest index duplicate, and collapse the array.

For example:

strarray[0] = "Line 1"; 
strarray[1] = "Line 2"; 
strarray[2] = "Line 3"; 
strarray[3] = "Line 2"; 
strarray[4] = "Line 4";

would become:

strarray[0] = "Line 1"; 
strarray[1] = "Line 3"; 
strarray[2] = "Line 2"; 
strarray[3] = "Line 4";

Index 1 of the original array was removed and indexes 2, 3, and 4 slid downwards to fill the gap.

I have one idea of how to do it. It is untested and I am currently attempting to code it but just from my faint understanding, I am sure this is a horrendous algorithm.

The algorithm presented below would be ran every time a new string is added to the strarray.

For the interest of showing that I am trying, I will include my proposed algorithm below:

Search entire strarray for match to str
If no match, do nothing
If match found, put str in strarray
Now we have a strarray with a max of 1 duplicate entry
Add highest index strarray string to lowest index of temporary string array
Continue downwards into strarray and check each element
If duplicate found, skip it
If not, add it to the next highest index of the temporary string array
Reverse temporary string array and copy to strarray

Once again, this is untested (I am currently implementing it now). I just hope someone out there will have a much better solution.

The order of items is important and the code must utilize the C language (not C++). The lowest index duplicates should be removed and the single highest index kept.

Thank you!

Billy ONeal · Accepted Answer · 2010-08-01 06:45:09Z

3

The typical efficient unique function is to:

Sort the given array.
Verify that consecutive runs of the same item are setup so that only one remains.

I believe you can use qsort in combination with strcmp to accomplish the first part; writing an efficient remove would be all on you though.

Unfortunately I don't have specific ideas here; this is kind of a grey area for me because I'm usually using C++, where this would be a simple:

std::vector<std::string> src;
std::sort(src.begin(), src.end());
src.remove(std::unique(src.begin(), src.end()), src.end);

I know you can't use C++, but the implementation should essentially be the same.

Because you need to save the original order, you can have something like:

typedef struct
{
    int originalPosition;
    char * string;
} tempUniqueEntry;

Do your first sort with respect to string, remove unique sets of elements on the sorted set, then resort with respect to originalPosition. This way you still get O(n lg n) performance, yet you don't lose the original order.

EDIT2: Simple C implementation example of std::unique:

tempUniqueEntry* unique ( tempUniqueEntry * first, tempUniqueEntry * last )
{
  tempUniqueEntry *result=first;
  while (++first != last)
  {
    if (strcmp(result->string,first->string))
      *(++result)=*first;
  }
  return ++result;
}

edited Aug 1, 2010 at 6:45

answered Aug 1, 2010 at 6:00

Billy ONeal

107k61 gold badges329 silver badges566 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Jerry Smith Over a year ago

wouldn't sorting lose the order of the elements?

Jerry Smith Over a year ago

Thank you for your edit! I am a little rusty on sorting but I can take it from here. I am going to try your idea and see how well it works. From what I understand, I need to iterate the strarray making a temp array of tempUniqueEntry in the process. Sort tempArray by string, remove duplicates, sort tempArray again by position, then reconstruct the strarray. Correct?

Billy ONeal Over a year ago

@Jerry: Yes, that's correct. You don't need to implement your own sorting algorithm; the standard library's qsort can do that. You'll just have to define comparison functions. As for remove, you may wish to follow std::unique, (like this one: cplusplus.com/reference/algorithm/unique ) because it can make the entire array unique in linear time -- no constant resizing of the array will be required as things are removed. (Just re implement std::unique yourself, replacing the predicate function with a function pointer returning equals, and the iterators with pointers)

jamesdlin · Accepted Answer · 2010-08-01 06:35:12Z

1

I don't quite understand your proposed algorithm (I don't understand what it means to add a string to an index in step 5), but what I would do is:

unsigned int i;
for (i = n; i > 0; i--)
{
    unsigned int j;

    if (strarray[i - 1] == NULL)
    {
        continue;
    }

    for (j = i - 1; j > 0; j--)
    {
        if (strcmp(strarray[i - 1], strarray[j - 1]) == 0)
        {
            strarray[j - 1] = NULL;
        }
    }
}

Then you just need to filter the null pointers out of your array (which I'll leave as an exercise).

A different approach would be to iterate backwards over the array and to insert each item into a (balanced) binary search tree as you go. If the item is already in the binary search tree, flag the array item (such as setting the array element to NULL) and move on. When you've processed the entire array, filter out the flagged elements as before. This would have slightly more overhead and would consume more space, but its running time would be O(n log n) instead of O(n^2).

edited Aug 1, 2010 at 6:35

answered Aug 1, 2010 at 6:28

jamesdlin

91.1k14 gold badges185 silver badges224 bronze badges

1 Comment

Jerry Smith Over a year ago

What I meant in step 5 is simply: // where 0 is the lowest index and 9 is the largest index available temparray[0] = strarray[9];

jdmichal · Accepted Answer · 2010-08-01 06:55:03Z

1

Can you control the input as it is going into the array? If so, just do something like this:

int addToArray(const char * toadd, char * strarray[], int strcount)
{
    const int toaddlen = strlen(toadd);

    // Add new string to end.
    // Remember to add one for the \0 terminator.
    strarray[strcount] = malloc(sizeof(char) * (toaddlen + 1));
    strncpy(strarray[strcount], toadd, toaddlen + 1);

    // Search for a duplicate.
    // Note that we are cutting the new array short by one.
    for(int i = 0; i < strcount; ++i)
    {
        if (strncmp(strarray[i], toaddlen + 1) == 0)
        {
            // Found duplicate.
            // Remove it and compact.
            // Note use of new array size here.  
            free(strarray[i]);
            for(int k = i + 1; k < strcount + 1; ++k)
                strarray[i] = strarray[k];

            strarray[strcount] = null;
            return strcount;
        }
    }

    // No duplicate found.
    return (strcount + 1);
}

You can always use the above function looping over the elements of an existing array, building a new array without duplicates.

PS: If you are doing this type of operation a lot, you should move away from an array as your storage structure, and used a linked list instead. They are much more efficient for removing elements from a location other than the end.

edited Aug 1, 2010 at 6:55

answered Aug 1, 2010 at 6:07

jdmichal

11.2k4 gold badges47 silver badges42 bronze badges

11 Comments

Billy ONeal Over a year ago

This works well; it's better than the OP's original solution. +1. But unfortunately the performance is still order n-squared :(

Jerry Smith Over a year ago

As I understand your solution, if it is in strarray already it does nothing. If it is not, it adds it. If I am correct in my understanding, this will not work. I can control the input as it is entering the array but this method would not produce the result I gave in my post. I need the surviving duplicate to be in the highest, not the lowest, index. If toadd already exists in strarray[1] it would not be added to strarray[N] where N > 1

jdmichal Over a year ago

@Jerry Smith Your example is wrong then. It should read 1, 3, 2, 4. I'll correct my solution shortly... But that is a much more expensive operation, because it will require compacting the array each time.

jdmichal Over a year ago

@Jerry Smith Added removal and compacting. Please note my PS at the end.

jdmichal Over a year ago

@Jerry Smith Thank you for understanding that some code typed into a webform in a time-pressured scenario may not compile without minor modifications, and instead understanding that the idea presented is the important part.

|

ninazzo · Accepted Answer · 2016-04-26 12:46:31Z

0

Sort the array with an algorithm like qsort (man 3 qsort in the terminal to see how it should be used) and then use the function strcmp to compare the strings and find duplicates

If you want to mantain the original order you could use a O(N^2) complexity algorithm nesting two for, the first each time pick an element to compare to the other and the second for will be used to scan the rest of the array to find if the chosen element is a duplicate.

edited Apr 26, 2016 at 12:46

answered Apr 26, 2016 at 12:41

ninazzo

6672 gold badges7 silver badges19 bronze badges

Collectives™ on Stack Overflow

How do I remove duplicate strings from an array in C?

4 Answers 4

3 Comments

1 Comment

11 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

3 Comments

1 Comment

11 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related