Converting from ascii to utf-8 with Python

Question

I have xmpp bot written in python. One of it's plugins is able to execute OS commands and send output to the user. As far as I know output should be unicode-like to send it over xmpp protocol. So I tried to handle it this way:

output = os.popen(cmd).read() 
if not isinstance(output, unicode):
   output = unicode(output,'utf-8','ignore')
bot.send(xmpp.Message(mess.getFrom(),output))

But when Russian symbols appear in output they aren't converted well.

sys.getdefaultencoding()

says that default command prompt encoding is 'ascii', but when I try to do

output.decode('ascii')

in python console I get

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 1: 
ordinal not in range(128)

OS: Win XP, Python 2.5.4 PS: Sorry for my English :(

Changing to output.decode('866') helped me. But locale.getpreferredencoding(do_setlocale=True) returned cp1251. Is there any other way to determine right encoding? Because this bot should work in linux as well — colriot
– colriot, Commented Feb 14, 2010 at 22:10
Erm... ASCII is already a perfect subset of UTF-8! Any ASCII text is, by definition, a UTF-8 text. Is the other way around intended here or is colriot asking to convert some other encoding to UTF-8? — Arafangion
– Arafangion, Commented Mar 2, 2011 at 14:06

Douglas Leeder · Accepted Answer · 2010-02-14 21:38:07Z

3

sys.getdefaultencoding() returns python's default encoding - which is ASCII unless you have changed it. ASCII doesn't support Russian characters.

You need to work out what encoding the actual text is, either manually, or using the locale module.

Typically something like:

import locale
encoding = locale.getpreferredencoding(do_setlocale=True)¶

answered Feb 14, 2010 at 21:38

Douglas Leeder

53.5k9 gold badges100 silver badges138 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

John Machin Over a year ago

On Windows, that will give cp1251 in the OP's (Russian) setup even when Python is run at MS-DOS-emulating command prompt; the OP needs cp866.

John Knoeller · Accepted Answer · 2010-02-15 00:23:10Z

2

Ascii has no defined character values above 127 0x7F. Perhaps you mean the Cyrillic code page? It's 866

See http://en.wikipedia.org/wiki/Code_page

edit: since this answer was marked correct presumably 886 worked, but as other answers have pointed it, 886 is not the only Russian language code page. If you use a code page different from the one that was used when the Russian symbols were encoded, you will get the wrong result.

edited Feb 15, 2010 at 0:23

answered Feb 14, 2010 at 21:35

John Knoeller

34.3k4 gold badges66 silver badges93 bronze badges

4 Comments

Glenn Maynard Over a year ago

Please use the real name, "KOI8-R", not the opaque Windows name "CP866".

colriot Over a year ago

But results of a.decode('cp866') and a.decode('koi8-r') are different

John Knoeller Over a year ago

If there is a portable identifier for the Cyrillic code page, it would be best to use it. Glenn, do you have a reference for KOI8-R ?

bobince Over a year ago

Code page 866 is nothing like KOI8-R at all, or indeed any other Russian encoding. As a DOS code page you don't generally meet it much any more. See en.wikipedia.org/wiki/Code_page_866 vs en.wikipedia.org/wiki/KOI8-R vs the more usual en.wikipedia.org/wiki/Windows-1251.

John Machin · Accepted Answer · 2010-02-18 22:01:17Z

1

You say """sys.getdefaultencoding() says that default command prompt encoding is 'ascii'"""

sys.getdefaultencoding says NOTHING about the "command prompt" encoding.

On Windows, sys.stdout.encoding should do the job. On my machine, it contains cp850 when Python is run in a Command Prompt window, and cp1252 in IDLE. Yours should contain cp866 and cp1251 respectively.

Update You say that you still need cp866 in IDLE. Note this:

IDLE 2.6.4      
>>> import os
>>> os.popen('chcp').read()
'Active code page: 850\n'
>>>

So when your app starts up, check if you are on Windows and if so, parse the result of os.popen('chcp').read(). The text before the : is probably locale-dependent. codepage = result.split()[-1] may be good enough "parsing". On Unix, which doesn't have a Windows/MS-DOS split personality, sys.stdout.encoding should be OK.

edited Feb 18, 2010 at 22:01

answered Feb 15, 2010 at 0:28

John Machin

83.2k12 gold badges147 silver badges193 bronze badges

4 Comments

colriot Over a year ago

Or not. How can I find out os.popen(command).read() default encoding? Or it depends on the command?

John Machin Over a year ago

os.popen("command").read() default encoding?? No such concept. The encoding of data being transmitted is chosen by (or forced upon) the WRITER; it has nothing to do with the READER, who needs to know or guess the encoding or obtain the encoding from a reliable source. Why are you asking? Why is sys.stdout.encoding not exactly what you wanted?

colriot Over a year ago

Because it does not matter whether you'll run python from Command Prompt or IDLE. 'cp866' is the right choise in both cases.

colriot Over a year ago

Thanks. This method seamed to be ideal. But when I tried to test bot with 'ipconfig' command... So 'cp1251' is real encoding of output in this case. Does this mean there is no universal method to solve my problem?

Mark Tolonen · Accepted Answer · 2010-02-14 22:48:51Z

0

In Python 'cp855', 'cp866', 'cp1251', 'iso8859_5', 'koi8_r' are differing Russian code pages. You'll need to use the right one to decode the output of popen. In the Windows console, the 'chcp' command lists the code page used by console commands. That won't necessarily be the same code page as Windows applications. On US Windows, 'cp437' is used for the console and 'cp1252' is used for applications like Notepad.

answered Feb 14, 2010 at 22:48

Mark Tolonen

181k26 gold badges182 silver badges278 bronze badges

Collectives™ on Stack Overflow

Converting from ascii to utf-8 with Python

4 Answers 4

1 Comment

4 Comments

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

4 Comments

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related