0

The words of the "wordslist" and the text I'm searching are in Cyrillic. The text is coded in UTF-8 (as set in Notepad++). I need Python to match a word in the text and get everything after the word until a full-stop followed by new line.

EDIT

with open('C:\....txt', 'rb') as f:
    wordslist = []
    for line in f:
        wordslist.append(line) 

wordslist = map(str.strip, wordslist)

/EDIT

for i in wordslist:
    print i #so far, so good, I get Cyrillic
    wantedtext = re.findall(i+".*\.\r\n", open('C:\....txt', 'rb').read())
    wantedtext = str(wantedtext)
    print wantedtext

"Wantedtext" shows and saves as "\xd0\xb2" (etc.).

What I tried:

This question is different, because there is no variable involved: Convert bytes to a python string. Also, the solution from the chosen answer

wantedtext.decode('utf-8')

didn't work, the result was the same. The solution from here didn't help either.

EDIT: Revised code, returning "[]".

with io.open('C:....txt', 'r', encoding='utf-8') as f:
    wordslist = f.read().splitlines() 

for i in wordslist:
    print i
    with io.open('C:....txt', 'r', encoding='utf-8') as my_file:
        my_file_test = my_file.read()
        print my_file_test #works, prints cyrillic characters, but...


        wantedtext = re.findall(i+".*\.\r\n", my_file_test)
        wantedtext = str(wantedtext)

        print wantedtext #returns []

(Added after a comment below: This code works if you erase \r from the regular expression.)

1
  • Where's the bit of code where you load the wordlist and save it? Commented Feb 13, 2017 at 17:50

1 Answer 1

0

Python 2.x only

Your find is probably not working because you're mixing strs and Unicodes strs, or strs containing different encodings. If you don't know what the difference between Unicode str and str, see: https://stackoverflow.com/a/35444608/1554386

Don't start decoding stuff unless you know what you're doing. It's not voodoo :)

You need to get all your text into Unicode objects first.

  1. Split your read into a separate line - it's easier to read
  2. Decode your text file. Use io.open() which support Python 3 decoding. I'm going assume your text file is UTF-8 (We'll soon find out if it's not):

    with io.open('C:\....txt', 'r', encoding='utf-8') as my_file:
        my_file_test = my_file.read()
    

    my_file_test is now a Unicode str

  3. Now you can do:

    # finds lines beginning with i, ending in .
    regex = u'^{i}*?\.$'.format(i=i)
    wantedtext = re.findall(regex, my_file_test, re.M)
    
  4. Look at wordslist. You don't say what you do with it but you need to make sure it's a Unicode str too. If you read from a file, use the same io.open from above.

Edit:

For wordslist, you can decode and read the file into a list while removing line feeds in one go:

with io.open('C:\....txt', 'r', encoding='utf-8') as f:
    wordslist = f.read().splitlines() 
Sign up to request clarification or add additional context in comments.

6 Comments

I get NameError: name 'io' is not defined. Do I need a package to use io.open?
yes, it's often implied on Stack Overflow when you see an unqualified package. import io
The program is running, it prints a word from a list, and then no result - [].
sorry, I screwed the wordlist logic during my edit. Please see update with io.open(). If it still doesn't work, then you'll have to provide more information
It looks like you're trying find all lines which start with a word and ends with .. Right? Because you're now using a file reader that supports Universal new line, you need to use proper regex line markers
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.