3

I have a text in Polish language in which I want to filter out non-Polish letters, but the problem is that Polish specific letters disappear

# coding: utf-8
import re

_NOT_LETTERS = re.compile('[^a-ząćęłóńśżź]+')

text = u'dzień dobry i wszystkiego najlepszego życzę'

data = _NOT_LETTERS.sub(' ', text)

print data

and the result is

 dzie dobry i wszystkiego najlepszego ycz 

instead of expected

dzień dobry i wszystkiego najlepszego życzę

How can I fix this ? I receive variable text from a third-party library

2
  • The pattern must use a unicode string too: re.compile(u'[^a-ząćęłóńśżź]+') otherwise multibyte characters are seen as separated bytes (ie: one byte, one char). Commented May 24, 2016 at 22:53
  • Great, it works. If you want add an answer and I'll accept it Commented May 24, 2016 at 22:59

1 Answer 1

1

Accented letters are not in the ascii range and need several bytes when encoded in UTF-8, for example the character:

U+0144  ń       LATIN SMALL LETTER N WITH ACUTE

is encoded on two bytes: c5 84

When you write a string without specifying that it is a string with multibyte characters, each single byte is seen as a character (the character \xc5 and the character \x84 but not the character ń (U+0144) that isn't recognized.)

In Python 2.7 you need to specify that your string is a unicode string otherwise all multibyte characters are seen as single bytes. You can test it yourself writing:

>>> text = u'dzień'
>>> [c for c in text]
[u'd', u'z', u'i', u'e', u'\u0144']

>>> text = 'dzień'
>>> [c for c in text]
['d', 'z', 'i', 'e', '\xc5', '\x84']

Characters are not found because your pattern isn't in a unicode string like your subject string. You need to write:

re.compile(u'[^a-ząćęłóńśżź]+')
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.