3

I have a large array list of sentences and another array list of words.

My program loops through the array list and removes an element from that array list if the sentence contains any of the words from the other.

The sentences array list can be very large and I coded a quick and dirty nested for loop. While this works for when there are not many sentences, in cases where their are, the time it takes to finish this operation is ridiculously long.

for (int i = 0; i < SENTENCES.size(); i++) {

        for (int k = 0; k < WORDS.size(); k++) {

            if (SENTENCES.get(i).contains(" " + WORDS.get(k) + " ") == true) {

                //Do something
            }
        }
    }

Is there a more efficient way of doing this then a nested for loop?

6
  • How long is your list of words? Are words allowed to contain special characters? Commented Jul 17, 2015 at 14:50
  • My list of words can vary depending on external factors. But in my last few times going through it, I ended up with around 200-300 words. Commented Jul 17, 2015 at 14:51
  • " " + WORDS.get(k) + " " won't find words in the beginning or ending of sentences Commented Jul 17, 2015 at 14:52
  • It really depends on what your are trying to do in the inner if clause Commented Jul 17, 2015 at 14:52
  • 1
    if you could use a set, instead of a list, that's O(1) lookup time, instead of O(n) Commented Jul 17, 2015 at 14:53

9 Answers 9

6

There's a few inefficiencies in your code, but at the end of the day, if you've got to search for sentences containing words then there's no getting away from loops.

That said, there are couple of things to try.

First, make WORDS a HashSet, the contains method will be far quicker than for an ArrayList because it's doing a hash look-up to get the value.

Second, switch the logic about a bit like this:

Iterator<String> sentenceIterator = SENTENCES.iterator();

sentenceLoop:
while (sentenceIterator.hasNext())
{
  String sentence = sentenceIterator.next();

  for (String word : sentence.replaceAll("\\p{P}", " ").toLowerCase().split("\\s+"))
  {
    if (WORDS.contains(word))
    {
      sentenceIterator.remove();
      continue sentenceLoop;
    }
  }      
}    

This code (which assumes you're trying to remove sentences that contain certain words) uses Iterators and avoids the string concatenation and parsing logic you had in your original code (replacing it with a single regex) both of which should be quicker.

But bear in mind, as with all things performance you'll need to test these changes to see they improve the situation.

Sign up to request clarification or add additional context in comments.

3 Comments

Yes. The internal loop is O(sentence length) rather than O(word list length).
While this improves one part of the problem, the algorithm as a whole will still be O(N) since the main problem is the removal of the sentence from that list.
this solution really worked for me. With regards to removal, I just changed how the program worked and recorded the words to be removed to a .csv file. It works out for me all the same anyway. Thanks again.
4

I̶ ̶w̶o̶u̶l̶d̶ ̶s̶a̶y̶ ̶n̶o̶,̶ ̶b̶u̶t̶ what you must change is the way you handle the removal of the data. This is noted by this part of the explanation of your problem:

The sentences array list can be very large (...). While this works for when there are not many sentences, in cases where their are, the time it takes to finish this operation is ridiculously long.

The cause of this is that removal time in ArrayList takes O(N), and since you're doing this inside a loop, then it will take at least O(N^2).

I recommend using LinkedList rather than ArrayList to store the sentences, and use Iterator rather than your naive List#get since it already offers Iterator#remove in time O(1) for LinkedList.

In case you cannot change the design to LinkedList, I recommend storing the sentences that are valid in a new List, and in the end replace the contents of your original List with this new List, thus saving lot of time.

Apart from this big improvement, you can improve the algorithm even more by using a Set to store the words to lookup rather than using another List since the lookup in a Set is O(1).

Comments

1

What you could do is put all your words into a HashSet. This allows you to check if a word is in the set very quickly. See https://docs.oracle.com/javase/8/docs/api/java/util/HashSet.html for documentation.

HashSet<String> wordSet = new HashSet();
for (String word : WORDS) {
    wordSet.add(word);
}

Then it's just a matter of splitting each sentence into the words that make it up, and checking if any of those words are in the set.

for (String sentence : SENTENCES) {
    String[] sentenceWords = sentence.split(" "); // You probably want to use a regex here instead of just splitting on a " ", but this is just an example.
    for (String word : sentenceWords) {
        if (wordSet.contains(word)) {
            // The sentence contains one of the special words.
            // DO SOMETHING
            break;
        }
    }
}

Comments

1

I will create a set of words from second ArrayList:

Set<String> listOfWords = new HashSet<String>();
listOfWords.add("one");
listOfWords.add("two");

I will then iterate over the set and the first ArrayList and use Contains:

for (String word : listOfWords) {
     for(String sentence : Sentences) {
           if (sentence.contains(word)) {
                // do something
           }
     }
 }

Also, if you are free to use any open source jar, check this out:

searching string in another string

4 Comments

And how did making the words into a set benefit you in this case?
Set contains unique elements. Arraylist may contain duplicates. And what cahen mentioned.
I would assume that uniqueness is not the OP's problem.
@Codebender: What about the time taken to divide every sentence in arrayList of sentences into words ?
1

First, your program has a bug: it would not count words at the beginning and at the end of a sentence.

Your current program has runtime complexity of O(s*w), where s is the length, in characters, of all sentences, and w is the length of all words, also in characters.

If words is relatively small (a few hundred items or so) you could use regex to speed things up considerably: construct a pattern like this, and use it in a loop:

StringBuilder regex = new StringBuilder();
boolean first = true;
// Let's say WORDS={"quick", "brown", "fox"}
regex.append("\\b(?:");
for (String w : WORDS) {
    if (!first) {
        regex.append('|');
    } else {
        first = false;
    }
    regex.append(w);
}
regex.append(")\\b");
// Now regex is "\b(?:quick|brown|fox)\b", i.e. your list of words
// separated by OR signs, enclosed in non-capturing groups
// anchored to word boundaries by '\b's on both sides.
Pattern p = Pattern.compile(regex.toString());
for (int i = 0; i < SENTENCES.size(); i++) {
    if (p.matcher(SENTENCES.get(i)).find()) {
        // Do something
    }
}

Since regex gets pre-compiled into a structure more suitable for fast searches, your program would run in O(s*max(w)), where s is the length, in characters, of all sentences, and w is the length of the longest word. Given that the number of words in your collection is about 200 or 300, this could give you an order of magnitude decrease in running time.

3 Comments

could you elaborate more after the Pattern p part? I don't see how matching any of the words in WORDS is achieved. I could look the API up, but it would save some time for future readers.
@vefthym I simplified the expression to "\b(?:quick|brown|fox)\b" to make it about as basic as it gets. Even cursory familiarity with regex library should be sufficient to figure out how it works.
forgive my ignorance, perhaps not everyone has a cursory familiarity with regex. thanks for the edit.
0

If you have enough memory you can tokenize SENTENCES and put them in a Set. Then it would be better in performance and also more correct than current implementation.

Comments

0

Well, looking at your code I would suggest two things that will improve the performance from each iteration:

  1. Remove " == true". The contains operation already returns a boolean, so it is enough for the if, comparing it with true adds one extra operation for each iteration that is not needed.
  2. Do not concatenate Strings inside a loop (" " + WORDS.get(k) + " ") as it is a quite expensive operation because + operator creates new objects. Better use a string buffer / builder and clear it after each iteration with stringBuffer.setLength(0);.

Besides that, for this case I do not know any other approach, maybe you can use regular expressions if you can abstract a pattern out of those words you want to remove and have then only one loop.

Hope it helps!

2 Comments

Not comparing to true is more a matter of good style than efficiency. It doesn't add anything to the order of magnitude of the operation. As for your second advice, since Java 1.6, the + operator is implemented with StringBuilder internally so there is no difference.
I guess with == true the compiler might just optimise it in the end, but with StringBuffer it will definitely help, as per each + it will create a new object. Creating only one outside the loops and clearing it after iterations should remove the instantiation overhead.
0

If you concern about the efficiency, I think that the most effective way to do this is to use Aho-Corasick's algorithm. While you have 2 nested loops here and a contains() method (that I think takes at the best length of sentence + length of word time), Aho-Corasick gives you one loop over sentences and for checking of containing words it takes length of sentence, which is length of word times faster (+ a preprocessing time for creation of finite state machine, which is relatively small).

Comments

0

I'll approach this in more theoretical view.. If you don't have memory limitation, you can try to mimic the logic in counting sort

say M1 = sentences.size, M2 = number of word per sentences, and N = word.size
Assume all sentences has the same number of words just for simplicity
your current approach's complexity is O(M1.M2.N)

We can create a mapping of words - position in sentences. Loop through your arraylist of sentences, and change them into two dimensional jagged array of words. Loop through the new array, create a HashMap where key,value = words, arraylist of word position (say with length X).
That's O(2M1.M2.X) = O(M1.M2.X)

Then loop through your words arraylist, access your word hashmap, loop through the list of word position. remove each one. That's O(N.X)

Say you're need to give the result in arraylist of string, we need another loop and concat everything. That's O(M1.M2)

Total complexity is O(M1.M2.X) + O(N.X) + O(M1.M2)
assumming X is way smaller than N, you'll probably get better performance

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.