2

I have an array of strings in C and an integer indicating how many strings are in the array.

char *strarray[MAX];  
int strcount;

In this array, the highest index (where 10 is higher than 0) is the most recent item added and the lowest index is the most distant item added. The order of items within the array matters.

I need a quick way to check the array for duplicates, remove all but the highest index duplicate, and collapse the array.

For example:

strarray[0] = "Line 1"; 
strarray[1] = "Line 2"; 
strarray[2] = "Line 3"; 
strarray[3] = "Line 2"; 
strarray[4] = "Line 4";

would become:

strarray[0] = "Line 1"; 
strarray[1] = "Line 3"; 
strarray[2] = "Line 2"; 
strarray[3] = "Line 4";

Index 1 of the original array was removed and indexes 2, 3, and 4 slid downwards to fill the gap.

I have one idea of how to do it. It is untested and I am currently attempting to code it but just from my faint understanding, I am sure this is a horrendous algorithm.

The algorithm presented below would be ran every time a new string is added to the strarray.

For the interest of showing that I am trying, I will include my proposed algorithm below:

  1. Search entire strarray for match to str
  2. If no match, do nothing
  3. If match found, put str in strarray
  4. Now we have a strarray with a max of 1 duplicate entry
  5. Add highest index strarray string to lowest index of temporary string array
  6. Continue downwards into strarray and check each element
  7. If duplicate found, skip it
  8. If not, add it to the next highest index of the temporary string array
  9. Reverse temporary string array and copy to strarray

Once again, this is untested (I am currently implementing it now). I just hope someone out there will have a much better solution.

The order of items is important and the code must utilize the C language (not C++). The lowest index duplicates should be removed and the single highest index kept.

Thank you!

4 Answers 4

3

The typical efficient unique function is to:

  1. Sort the given array.
  2. Verify that consecutive runs of the same item are setup so that only one remains.

I believe you can use qsort in combination with strcmp to accomplish the first part; writing an efficient remove would be all on you though.

Unfortunately I don't have specific ideas here; this is kind of a grey area for me because I'm usually using C++, where this would be a simple:

std::vector<std::string> src;
std::sort(src.begin(), src.end());
src.remove(std::unique(src.begin(), src.end()), src.end);

I know you can't use C++, but the implementation should essentially be the same.

Because you need to save the original order, you can have something like:

typedef struct
{
    int originalPosition;
    char * string;
} tempUniqueEntry;

Do your first sort with respect to string, remove unique sets of elements on the sorted set, then resort with respect to originalPosition. This way you still get O(n lg n) performance, yet you don't lose the original order.

EDIT2: Simple C implementation example of std::unique:

tempUniqueEntry* unique ( tempUniqueEntry * first, tempUniqueEntry * last )
{
  tempUniqueEntry *result=first;
  while (++first != last)
  {
    if (strcmp(result->string,first->string))
      *(++result)=*first;
  }
  return ++result;
}
Sign up to request clarification or add additional context in comments.

3 Comments

wouldn't sorting lose the order of the elements?
Thank you for your edit! I am a little rusty on sorting but I can take it from here. I am going to try your idea and see how well it works. From what I understand, I need to iterate the strarray making a temp array of tempUniqueEntry in the process. Sort tempArray by string, remove duplicates, sort tempArray again by position, then reconstruct the strarray. Correct?
@Jerry: Yes, that's correct. You don't need to implement your own sorting algorithm; the standard library's qsort can do that. You'll just have to define comparison functions. As for remove, you may wish to follow std::unique, (like this one: cplusplus.com/reference/algorithm/unique ) because it can make the entire array unique in linear time -- no constant resizing of the array will be required as things are removed. (Just re implement std::unique yourself, replacing the predicate function with a function pointer returning equals, and the iterators with pointers)
1

I don't quite understand your proposed algorithm (I don't understand what it means to add a string to an index in step 5), but what I would do is:

unsigned int i;
for (i = n; i > 0; i--)
{
    unsigned int j;

    if (strarray[i - 1] == NULL)
    {
        continue;
    }

    for (j = i - 1; j > 0; j--)
    {
        if (strcmp(strarray[i - 1], strarray[j - 1]) == 0)
        {
            strarray[j - 1] = NULL;
        }
    }
}

Then you just need to filter the null pointers out of your array (which I'll leave as an exercise).

A different approach would be to iterate backwards over the array and to insert each item into a (balanced) binary search tree as you go. If the item is already in the binary search tree, flag the array item (such as setting the array element to NULL) and move on. When you've processed the entire array, filter out the flagged elements as before. This would have slightly more overhead and would consume more space, but its running time would be O(n log n) instead of O(n^2).

1 Comment

What I meant in step 5 is simply: // where 0 is the lowest index and 9 is the largest index available temparray[0] = strarray[9];
1

Can you control the input as it is going into the array? If so, just do something like this:

int addToArray(const char * toadd, char * strarray[], int strcount)
{
    const int toaddlen = strlen(toadd);

    // Add new string to end.
    // Remember to add one for the \0 terminator.
    strarray[strcount] = malloc(sizeof(char) * (toaddlen + 1));
    strncpy(strarray[strcount], toadd, toaddlen + 1);

    // Search for a duplicate.
    // Note that we are cutting the new array short by one.
    for(int i = 0; i < strcount; ++i)
    {
        if (strncmp(strarray[i], toaddlen + 1) == 0)
        {
            // Found duplicate.
            // Remove it and compact.
            // Note use of new array size here.  
            free(strarray[i]);
            for(int k = i + 1; k < strcount + 1; ++k)
                strarray[i] = strarray[k];

            strarray[strcount] = null;
            return strcount;
        }
    }

    // No duplicate found.
    return (strcount + 1);
}

You can always use the above function looping over the elements of an existing array, building a new array without duplicates.

PS: If you are doing this type of operation a lot, you should move away from an array as your storage structure, and used a linked list instead. They are much more efficient for removing elements from a location other than the end.

11 Comments

This works well; it's better than the OP's original solution. +1. But unfortunately the performance is still order n-squared :(
As I understand your solution, if it is in strarray already it does nothing. If it is not, it adds it. If I am correct in my understanding, this will not work. I can control the input as it is entering the array but this method would not produce the result I gave in my post. I need the surviving duplicate to be in the highest, not the lowest, index. If toadd already exists in strarray[1] it would not be added to strarray[N] where N > 1
@Jerry Smith Your example is wrong then. It should read 1, 3, 2, 4. I'll correct my solution shortly... But that is a much more expensive operation, because it will require compacting the array each time.
@Jerry Smith Added removal and compacting. Please note my PS at the end.
@Jerry Smith Thank you for understanding that some code typed into a webform in a time-pressured scenario may not compile without minor modifications, and instead understanding that the idea presented is the important part.
|
0

Sort the array with an algorithm like qsort (man 3 qsort in the terminal to see how it should be used) and then use the function strcmp to compare the strings and find duplicates

If you want to mantain the original order you could use a O(N^2) complexity algorithm nesting two for, the first each time pick an element to compare to the other and the second for will be used to scan the rest of the array to find if the chosen element is a duplicate.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.