Regex search with variable in Python 2.7 returns bytes instead of decoded text

Question

The words of the "wordslist" and the text I'm searching are in Cyrillic. The text is coded in UTF-8 (as set in Notepad++). I need Python to match a word in the text and get everything after the word until a full-stop followed by new line.

EDIT

with open('C:\....txt', 'rb') as f:
    wordslist = []
    for line in f:
        wordslist.append(line) 

wordslist = map(str.strip, wordslist)

/EDIT

for i in wordslist:
    print i #so far, so good, I get Cyrillic
    wantedtext = re.findall(i+".*\.\r\n", open('C:\....txt', 'rb').read())
    wantedtext = str(wantedtext)
    print wantedtext

"Wantedtext" shows and saves as "\xd0\xb2" (etc.).

What I tried:

This question is different, because there is no variable involved: Convert bytes to a python string. Also, the solution from the chosen answer

wantedtext.decode('utf-8')

didn't work, the result was the same. The solution from here didn't help either.

EDIT: Revised code, returning "[]".

with io.open('C:....txt', 'r', encoding='utf-8') as f:
    wordslist = f.read().splitlines() 

for i in wordslist:
    print i
    with io.open('C:....txt', 'r', encoding='utf-8') as my_file:
        my_file_test = my_file.read()
        print my_file_test #works, prints cyrillic characters, but...


        wantedtext = re.findall(i+".*\.\r\n", my_file_test)
        wantedtext = str(wantedtext)

        print wantedtext #returns []

(Added after a comment below: This code works if you erase \r from the regular expression.)

Where's the bit of code where you load the wordlist and save it? — Alastair McCormack
– Alastair McCormack, Commented Feb 13, 2017 at 17:50

Community · Accepted Answer · 2017-05-23 12:01:35Z

0

Python 2.x only

Your find is probably not working because you're mixing strs and Unicodes strs, or strs containing different encodings. If you don't know what the difference between Unicode str and str, see: https://stackoverflow.com/a/35444608/1554386

Don't start decoding stuff unless you know what you're doing. It's not voodoo :)

You need to get all your text into Unicode objects first.

Split your read into a separate line - it's easier to read
Decode your text file. Use io.open() which support Python 3 decoding. I'm going assume your text file is UTF-8 (We'll soon find out if it's not):
```
with io.open('C:\....txt', 'r', encoding='utf-8') as my_file:
    my_file_test = my_file.read()
```
my_file_test is now a Unicode str

Now you can do:

# finds lines beginning with i, ending in .
regex = u'^{i}*?\.$'.format(i=i)
wantedtext = re.findall(regex, my_file_test, re.M)

Look at wordslist. You don't say what you do with it but you need to make sure it's a Unicode str too. If you read from a file, use the same io.open from above.

Edit:

For wordslist, you can decode and read the file into a list while removing line feeds in one go:

with io.open('C:\....txt', 'r', encoding='utf-8') as f:
    wordslist = f.read().splitlines()

edited May 23, 2017 at 12:01

CommunityBot

11 silver badge

answered Feb 13, 2017 at 18:05

Alastair McCormack

28k8 gold badges81 silver badges106 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Tag Over a year ago

I get NameError: name 'io' is not defined. Do I need a package to use io.open?

Alastair McCormack Over a year ago

yes, it's often implied on Stack Overflow when you see an unqualified package. import io

Tag Over a year ago

The program is running, it prints a word from a list, and then no result - [].

Alastair McCormack Over a year ago

sorry, I screwed the wordlist logic during my edit. Please see update with io.open(). If it still doesn't work, then you'll have to provide more information

Alastair McCormack Over a year ago

It looks like you're trying find all lines which start with a word and ends with .. Right? Because you're now using a file reader that supports Universal new line, you need to use proper regex line markers

|

Collectives™ on Stack Overflow

Regex search with variable in Python 2.7 returns bytes instead of decoded text

1 Answer 1

Python 2.x only

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Python 2.x only

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related