0

I have a different strings that certainly contains myWord (multiple times in some cases, only the first occurence should be handled), but the length of the strings are different. Some of them contains hundreds of substrings, some of the contains only a few substrings.

I would like to find a solution to obtain a snippet from the text. The rules are the following: the snippet should contains myWord and the X words before and after.

Something like this:

rawText= "This is an example lorem ipsum sentence for a Stackoverflow question."

myWord = "sentence"

Let's say I would like to get the content from the word 'sentence' and plus/minus 3 words like this:

"example lorem ipsum sentence for a Stackoverflow"

I could create a working solution, however it uses the number of chars to cut the snippet instead of the number of words before/after the myWord. So my question would be that is there any more suitable solution, maybe a built-in Python function to achieve my goal?

The current solution I use:

myWord = "mollis"
rawText = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse sit amet arcu vulputate, sodales arcu non, finibus odio. Aliquam sed tincidunt nisi, eu scelerisque lectus. Curabitur in nibh enim. Duis arcu ante, mollis sed iaculis non, hendrerit ut odio. Curabitur gravida condimentum posuere. Sed et arcu finibus felis auctor mollis et id risus. Nam urna tellus, ultricies a aliquam at, euismod et erat. Cras pretium venenatis ornare. Donec pulvinar dui eu dui facilisis commodo. Vivamus eget ultrices turpis, vel egestas lacus."

# The index where the word is located
wordIndexNumber = rawText.lower().find("%s" % (myWord,))

# The total length of the text (in chars)
textLength = len(rawText)

textPart2 = len(rawText)-wordIndexNumber

if wordIndexNumber < 80:
    textIndex1 = 0
else:
    textIndex1 = wordIndexNumber - 80

if textPart2 < 80:
    textIndex2 = textLength
else:
    textIndex2 = wordIndexNumber + 80

snippet = rawText[textIndex1:textIndex2]

print (snippet)
1
  • Use split() on your string, then apply your character-based solution to the resulting list. Commented Jun 26, 2018 at 13:15

2 Answers 2

1

This is one approach using string slicing.

Demo:

rawText= "This is an example lorem ipsum sentence for a Stackoverflow question."
myWord = "sentence"
rawTextList = rawText.split()
frontVal = " ".join( rawTextList[rawTextList.index(myWord)-3:rawTextList.index(myWord)] )
backVal = " ".join( rawTextList[rawTextList.index(myWord):rawTextList.index(myWord)+4] )

print("{} {}".format(frontVal, backVal))

Output:

example lorem ipsum sentence for a Stackoverflow
Sign up to request clarification or add additional context in comments.

Comments

1

Here is solution using array slicing

def get_context_around(text, word, accuracy):
    words = text.split()
    first_hit = words.index(word)

    return ' '.join(words[first_hit - accuracy:first_hit + accuracy + 1])


raw_text= "This is an example lorem ipsum sentence for a Stackoverflow question."
my_word = "sentence"
print(get_context_around(raw_text, my_word, accuracy=3)) # example lorem ipsum sentence for a Stackoverflow

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.