5

I'm having some issues with making the following regex work. I would like the following string:

"Please enter your name here"

to result in an array with the following elements:

'please enter', 'enter your', 'your name', 'name here'

Currently, I'm using the following pattern, and then creating a matcher and iterating in the following way:

Pattern word = Pattern.compile("[\w]+ [\w]+");
Matcher m = word.matcher("Please enter your name here");

while (m.find()) {
    wordList.add(m.group());
}

But the result I'm getting is:

'please enter', 'your name'

What am I doing wrong? (P.s., i checked the same regex on regexpal.com and had the same problem). It seems like the same word won't be matched twice. What can I do to achieve the result I want?

Thanks.

---------------------------------

EDIT: Thanks for all the suggestions! I ended up doing this (because it adds flexibility in being able to easily specify number of "n-grams"):

Integer nGrams = 2;
String patternTpl = "\\b[\\w']+\\b";
String concatString = "what is your age? please enter your name."
for (int i = 0; i < nGrams; i++) {
    // Create pattern.
    String pattern = patternTpl;
    for (int j = 0; j < i; j++) {
        pattern = pattern + " " + patternTpl;
    }
    pattern = "(?=(" + pattern + "))";
    Pattern word = Pattern.compile(pattern);
    Matcher m = word.matcher(concatString);

    // Iterate over all words and populate wordList
    while (m.find()) {
        wordList.add(m.group(1));
    }
}

This results in:

Pattern: 
(?=(\b[\w']+\b)) // In the first iteration
(?=(\b[\w']+\b \b[\w']+\b)) // In the second iteration

Array:
[what, is, your, age, please, enter, your, name, what is, is your, your age, please enter, enter your, your name]

Note: Got the pattern from the following top answer: Java regex skipping matches

1
  • split the string with space, you got words array, you can get element[i] and element[i+1]. of course be careful about the OutOfBoundEx Commented Sep 11, 2013 at 21:25

4 Answers 4

9

The matches can't overlap, which explains your result. Here's a potential workaround, making use of capturing groups with a positive lookahead:

Pattern word = Pattern.compile("(\\w+)(?=(\\s\\w+))");
Matcher m = word.matcher("Please enter your name here");

while (m.find()) {
    System.out.println(m.group(1) + m.group(2));
}
Please enter
enter your
your name
name here
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks, this was the closest to what I wanted to do, and this might be more efficient than the edit I did in the end. I'll look into perhaps using this instead of what I ended up doing.
How would I make this work with any number of "n-grams"? So if I would also like to match "Please enter your", "enter your name", "your name here"? EDIT: Figured it out, I can just add more (?=(\\s\\w+)), based on the number of n-grams needed.
1

If you want to avoid using such specific RegEx, perhaps you should try a simpler, and more easier, solution:

public static String[] array(final String string){
    final String[] words = string.split(" ");
    final String[] array = new String[words.length-1];
    for(int i = 0; i < words.length-1; i++)
        array[i] = String.format("%s %s", words[i], words[i+1]);
    return array;
}

public static void main(String args[]){
    final String[] array = array("Please enter your name here");
    System.out.println(Arrays.toString(array));
}

The output is:

[Please enter, enter your, your name, name here]

1 Comment

I might look into using this solution as well, it certainly is more efficient than looping through all the words as I'm doing now for nGrams > 1.
0

You're not doing anything wrong. It's just the way a regex works (otherwise matching would become O(n^2), since regex matching is done in linear time, this cannot be processed).

In this case you could simply search for [\w]+. And postprocess these groups.

Comments

0

Something like:

Pattern word = Pattern.compile("(\\w+) ?");
Matcher m = word.matcher("Please enter your name here");

String previous = null;
while (m.find()) {
    if (previous != null)
        wordList.add(previous + m.group(1));
    previous = m.group();
}

The pattern ends with an optional space (which matches if there are more spaces in the string). m.group() returns the entire match, with the space; m.group(1) returns just the word, without the space.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.