1

I need to split a string up into all possible ways without changing the order of the characters. I understand this task can be seen as tokenization or lemmatization in NLP but i am attempting it from a pure string search perspective which is simpler and more robust. Given,

dictionary = ['train','station', 'fire', 'a','trainer','in']
str1 = "firetrainstation"

Task 1: How do i generate all possible substrings such that i get:

all_possible_substrings = [['f','iretrainstation'],
['fo','retrainstation'], ...
['firetrainstatio','n'],
['f','i','retrainstation'], ... , ...
['fire','train','station'], ... , ...
['fire','tr','a','instation'], ... , ...
['fire','tr','a','in','station'], ... , ...
['f','i','r','e','t','r','a','i','n','s','t','a','t','i','o','n']

Task 2: Then from all_possible_substring, how could i check through to see and say that the substring set that contains all elements from the dictionary is the correct output. The desired output would be a list of substrings from the dictionary that matches the most number of characters from left to right. Desired output is where:

"".join(desire_substring_list) == str1 and \
[i for i desire_substring_list if in dictionary] == len(desire_substring_list)
#(let's assume, the above condition can be true for any input string since my english
#language dictionary is very big and all my strings are human language 
#just written without spaces)

Desired output:

'fire','train','station'

What have I done?

For task 1, i have done this but i know it wouldn't give me all possible whitespace insertions:

all_possible_substrings.append(" ".join(str1))

I've done this but this only does task 2:

import re
seed = ['train','station', 'fire', 'a','trainer','in']
str1 = "firetrainstation"
all_possible_string = [['f','iretrainstation'],
['fo','retrainstation'],
['firetrainstatio','n'],
['f','i','retrainstation'], 
['fire','train','station'], 
['fire','tr','a','instation'], 
['fire','tr','a','in','station'], 
['f','i','r','e','t','r','a','i','n','s','t','a','t','i','o','n']]
pattern = re.compile(r'\b(?:' + '|'.join(re.escape(s) for s in seed) + r')\b')
highest_match = ""
for i in all_possible_string:
  x = pattern.findall(" ".join(i))
  if "".join(x) == str1 and len([i for i in x if i in seed]) == len(x):
    print " ".join(x)
10
  • Note that your dictionary is actually a list. Commented Mar 19, 2013 at 1:24
  • Also, I'm pretty sure you need to do a little more explanation. Why is `'foo','bar','bar','str' the desired output? Commented Mar 19, 2013 at 1:25
  • updated on the desired output. Commented Mar 19, 2013 at 1:35
  • is it clearer in this context? Commented Mar 19, 2013 at 1:53
  • How do you get str1 from dictionary? And I might be misunderstanding, but wouldn't the "list of substrings from the dictionary that matches the most number of characters from left to right" always be str1 minus the last letter? (Assuming you don't want the whole string.) Commented Mar 19, 2013 at 1:55

1 Answer 1

3

For the first part you could write a recursive generator similar to this:

>>> def all_substr(string):
    for i in range(len(string)):

        if i == len(string) - 1:
            yield string

        first_part = string[0:i+1]
        second_part = string[i+1:]

        for j in all_substr(second_part):
            yield ','.join([first_part, j])


>>> for x in all_substr('apple'):
    print(x)


a,p,p,l,e
a,p,p,le
a,p,pl,e
a,p,ple
a,pp,l,e
a,pp,le
a,ppl,e
a,pple
ap,p,l,e
ap,p,le
ap,pl,e
ap,ple
app,l,e
app,le
appl,e
apple
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.