Obtaining substring from string based on substring matching and string index

Question

I have a different strings that certainly contains myWord (multiple times in some cases, only the first occurence should be handled), but the length of the strings are different. Some of them contains hundreds of substrings, some of the contains only a few substrings.

I would like to find a solution to obtain a snippet from the text. The rules are the following: the snippet should contains myWord and the X words before and after.

Something like this:

rawText= "This is an example lorem ipsum sentence for a Stackoverflow question."

myWord = "sentence"

Let's say I would like to get the content from the word 'sentence' and plus/minus 3 words like this:

"example lorem ipsum sentence for a Stackoverflow"

I could create a working solution, however it uses the number of chars to cut the snippet instead of the number of words before/after the myWord. So my question would be that is there any more suitable solution, maybe a built-in Python function to achieve my goal?

The current solution I use:

myWord = "mollis"
rawText = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse sit amet arcu vulputate, sodales arcu non, finibus odio. Aliquam sed tincidunt nisi, eu scelerisque lectus. Curabitur in nibh enim. Duis arcu ante, mollis sed iaculis non, hendrerit ut odio. Curabitur gravida condimentum posuere. Sed et arcu finibus felis auctor mollis et id risus. Nam urna tellus, ultricies a aliquam at, euismod et erat. Cras pretium venenatis ornare. Donec pulvinar dui eu dui facilisis commodo. Vivamus eget ultrices turpis, vel egestas lacus."

# The index where the word is located
wordIndexNumber = rawText.lower().find("%s" % (myWord,))

# The total length of the text (in chars)
textLength = len(rawText)

textPart2 = len(rawText)-wordIndexNumber

if wordIndexNumber < 80:
    textIndex1 = 0
else:
    textIndex1 = wordIndexNumber - 80

if textPart2 < 80:
    textIndex2 = textLength
else:
    textIndex2 = wordIndexNumber + 80

snippet = rawText[textIndex1:textIndex2]

print (snippet)

Use split() on your string, then apply your character-based solution to the resulting list. — alexis
– alexis, Commented Jun 26, 2018 at 13:15

Rakesh · Accepted Answer · 2018-06-26 13:14:09Z

1

This is one approach using string slicing.

Demo:

rawText= "This is an example lorem ipsum sentence for a Stackoverflow question."
myWord = "sentence"
rawTextList = rawText.split()
frontVal = " ".join( rawTextList[rawTextList.index(myWord)-3:rawTextList.index(myWord)] )
backVal = " ".join( rawTextList[rawTextList.index(myWord):rawTextList.index(myWord)+4] )

print("{} {}".format(frontVal, backVal))

Output:

example lorem ipsum sentence for a Stackoverflow

answered Jun 26, 2018 at 13:14

Rakesh

82.9k17 gold badges85 silver badges122 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

donnyyy · Accepted Answer · 2018-06-26 13:17:02Z

1

Here is solution using array slicing

def get_context_around(text, word, accuracy):
    words = text.split()
    first_hit = words.index(word)

    return ' '.join(words[first_hit - accuracy:first_hit + accuracy + 1])


raw_text= "This is an example lorem ipsum sentence for a Stackoverflow question."
my_word = "sentence"
print(get_context_around(raw_text, my_word, accuracy=3)) # example lorem ipsum sentence for a Stackoverflow

answered Jun 26, 2018 at 13:17

donnyyy

4523 silver badges11 bronze badges

Collectives™ on Stack Overflow

Obtaining substring from string based on substring matching and string index

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related