Python output encoding

Question

Code is presented below. Runs with python 2 in Debian 9.

# -*- coding: utf-8 -*- 
import requests
import bs4

# repairing invalid HTML
s = requests.get('http://vstup.info/2017/i2017i483.html')
tmp = s.text.replace("</td></tr></td></tr><tr><td>", "</td></tr><tr><td>")

bs = bs4.BeautifulSoup(tmp, "html.parser")

content = bs.find("div", {"id": "okrArea"}).find("table", {"id": "about"}).findAll("tr")

typ = content[1].findAll("td")[1].get_text() #ZVO type

print typ
print [typ]

It outputs this:

Ð¢ÐµÑ
Ð½ÑÐºÑÐ¼ (ÑÑÐ¸Ð»Ð¸ÑÐµ)
[u'\xd0\xa2\xd0\xb5\xd1\x85\xd0\xbd\xd1\x96\xd0\xba\xd1\x83\xd0\xbc (\xd1\x83\xd1\x87\xd0\xb8\xd0\xbb\xd0\xb8\xd1\x89\xd0\xb5)']

Why do variable print output differs from this variable in list?
How to get correct value from web-page

Технікум (училище)

In interactive python it can be get from backslashed codes in this way

>>> print '\xd0\xa2\xd0\xb5\xd1\x85\xd0\xbd\xd1\x96\xd0\xba\xd1\x83\xd0\xbc (\xd1\x83\xd1\x87\xd0\xb8\xd0\xbb\xd0\xb8\xd1\x89\xd0\xb5)'.decode('utf8')
Технікум (училище)

BeautifulSoup has used an incorrect codec.

Martijn Pieters
– Martijn Pieters

2018-06-23 19:31:56 +00:00
Commented Jun 23, 2018 at 19:31 — Martijn Pieters
– Martijn Pieters, Commented Jun 23, 2018 at 19:31

Martijn Pieters · Accepted Answer · 2018-06-23 23:58:49Z

3

You made the mistake of trusting the HTTP content character set set by the server, by using response.text. This gives you Unicode text decoded from the binary response data using the header information, which here is wrong. You then give the Unicode string to BeautifulSoup, which assumes that it was correctly decoded.

Instead, use the response.content attribute, which gives you the raw binary string content body:

tmp = s.conent.replace("</td></tr></td></tr><tr><td>", "</td></tr><tr><td>")

Now the data remains a binary string and BeautifulSoup will do the decoding for you, based on information in the HTML document itself (there’s a <meta> tag with the correct codec information in there):

>>> import requests, bs4
>>> s = requests.get('http://vstup.info/2017/i2017i483.html')
>>> tmp = s.content.replace("</td></tr></td></tr><tr><td>", "</td></tr><tr><td>")
>>> bs = bs4.BeautifulSoup(tmp, "html.parser")
>>> content = bs.select("div#okrArea table#about tr")
>>> typ = content[1].findAll("td")[1].get_text()
>>> print typ
Технікум (училище)

edited Jun 23, 2018 at 23:58

answered Jun 23, 2018 at 19:34

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

PM 2Ring Over a year ago

FWIW, the page is a little bit confusing. It has a <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> but the response header has Encoding: ISO-8859-1. Yes, I realise that's sadly all too common.

Martijn Pieters Over a year ago

@PM2Ring ah, yes, I missed the .text error there. Corrected.

Rakesh · Accepted Answer · 2018-06-23 19:33:48Z

2

Use encoding latin1

Ex:

import requests
import bs4

s = requests.get('http://vstup.info/2017/i2017i483.html')
tmp = s.text.replace("</td></tr></td></tr><tr><td>", "</td></tr><tr><td>")

bs = bs4.BeautifulSoup(tmp, "html.parser")

content = bs.find("div", {"id": "okrArea"}).find("table", {"id": "about"}).findAll("tr")

typ = content[1].findAll("td")[1].get_text() #ZVO type

print typ.encode("latin1")

Output:

Технікум (училище)

answered Jun 23, 2018 at 19:33

Rakesh

82.9k17 gold badges85 silver badges122 bronze badges

Collectives™ on Stack Overflow

Python output encoding

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related