How to split a string and match its substrings to a list of substrings? - Python

Question

I need to split a string up into all possible ways without changing the order of the characters. I understand this task can be seen as tokenization or lemmatization in NLP but i am attempting it from a pure string search perspective which is simpler and more robust. Given,

dictionary = ['train','station', 'fire', 'a','trainer','in']
str1 = "firetrainstation"

Task 1: How do i generate all possible substrings such that i get:

all_possible_substrings = [['f','iretrainstation'],
['fo','retrainstation'], ...
['firetrainstatio','n'],
['f','i','retrainstation'], ... , ...
['fire','train','station'], ... , ...
['fire','tr','a','instation'], ... , ...
['fire','tr','a','in','station'], ... , ...
['f','i','r','e','t','r','a','i','n','s','t','a','t','i','o','n']

Task 2: Then from all_possible_substring, how could i check through to see and say that the substring set that contains all elements from the dictionary is the correct output. The desired output would be a list of substrings from the dictionary that matches the most number of characters from left to right. Desired output is where:

"".join(desire_substring_list) == str1 and \
[i for i desire_substring_list if in dictionary] == len(desire_substring_list)
#(let's assume, the above condition can be true for any input string since my english
#language dictionary is very big and all my strings are human language 
#just written without spaces)

Desired output:

'fire','train','station'

What have I done?

For task 1, i have done this but i know it wouldn't give me all possible whitespace insertions:

all_possible_substrings.append(" ".join(str1))

I've done this but this only does task 2:

import re
seed = ['train','station', 'fire', 'a','trainer','in']
str1 = "firetrainstation"
all_possible_string = [['f','iretrainstation'],
['fo','retrainstation'],
['firetrainstatio','n'],
['f','i','retrainstation'], 
['fire','train','station'], 
['fire','tr','a','instation'], 
['fire','tr','a','in','station'], 
['f','i','r','e','t','r','a','i','n','s','t','a','t','i','o','n']]
pattern = re.compile(r'\b(?:' + '|'.join(re.escape(s) for s in seed) + r')\b')
highest_match = ""
for i in all_possible_string:
  x = pattern.findall(" ".join(i))
  if "".join(x) == str1 and len([i for i in x if i in seed]) == len(x):
    print " ".join(x)

Also, I'm pretty sure you need to do a little more explanation. Why is `'foo','bar','bar','str' the desired output? — mgilson
– mgilson, Commented Mar 19, 2013 at 1:25
How do you get str1 from dictionary? And I might be misunderstanding, but wouldn't the "list of substrings from the dictionary that matches the most number of characters from left to right" always be str1 minus the last letter? (Assuming you don't want the whole string.) — toxotes
– toxotes, Commented Mar 19, 2013 at 1:55

jurgenreza · Accepted Answer · 2013-03-19 04:17:31Z

3

For the first part you could write a recursive generator similar to this:

>>> def all_substr(string):
    for i in range(len(string)):

        if i == len(string) - 1:
            yield string

        first_part = string[0:i+1]
        second_part = string[i+1:]

        for j in all_substr(second_part):
            yield ','.join([first_part, j])


>>> for x in all_substr('apple'):
    print(x)


a,p,p,l,e
a,p,p,le
a,p,pl,e
a,p,ple
a,pp,l,e
a,pp,le
a,ppl,e
a,pple
ap,p,l,e
ap,p,le
ap,pl,e
ap,ple
app,l,e
app,le
appl,e
apple

answered Mar 19, 2013 at 4:17

jurgenreza

6,1042 gold badges30 silver badges37 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How to split a string and match its substrings to a list of substrings? - Python

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related