1

Code is presented below. Runs with python 2 in Debian 9.

# -*- coding: utf-8 -*- 
import requests
import bs4

# repairing invalid HTML
s = requests.get('http://vstup.info/2017/i2017i483.html')
tmp = s.text.replace("</td></tr></td></tr><tr><td>", "</td></tr><tr><td>")

bs = bs4.BeautifulSoup(tmp, "html.parser")

content = bs.find("div", {"id": "okrArea"}).find("table", {"id": "about"}).findAll("tr")

typ = content[1].findAll("td")[1].get_text() #ZVO type

print typ
print [typ]

It outputs this:

ТеÑ
нÑкÑм (ÑÑилиÑе)
[u'\xd0\xa2\xd0\xb5\xd1\x85\xd0\xbd\xd1\x96\xd0\xba\xd1\x83\xd0\xbc (\xd1\x83\xd1\x87\xd0\xb8\xd0\xbb\xd0\xb8\xd1\x89\xd0\xb5)']
  1. Why do variable print output differs from this variable in list?
  2. How to get correct value from web-page

Технікум (училище)

In interactive python it can be get from backslashed codes in this way

>>> print '\xd0\xa2\xd0\xb5\xd1\x85\xd0\xbd\xd1\x96\xd0\xba\xd1\x83\xd0\xbc (\xd1\x83\xd1\x87\xd0\xb8\xd0\xbb\xd0\xb8\xd1\x89\xd0\xb5)'.decode('utf8')
Технікум (училище)
1
  • 1
    BeautifulSoup has used an incorrect codec. Commented Jun 23, 2018 at 19:31

2 Answers 2

3

You made the mistake of trusting the HTTP content character set set by the server, by using response.text. This gives you Unicode text decoded from the binary response data using the header information, which here is wrong. You then give the Unicode string to BeautifulSoup, which assumes that it was correctly decoded.

Instead, use the response.content attribute, which gives you the raw binary string content body:

tmp = s.conent.replace("</td></tr></td></tr><tr><td>", "</td></tr><tr><td>")

Now the data remains a binary string and BeautifulSoup will do the decoding for you, based on information in the HTML document itself (there’s a <meta> tag with the correct codec information in there):

>>> import requests, bs4
>>> s = requests.get('http://vstup.info/2017/i2017i483.html')
>>> tmp = s.content.replace("</td></tr></td></tr><tr><td>", "</td></tr><tr><td>")
>>> bs = bs4.BeautifulSoup(tmp, "html.parser")
>>> content = bs.select("div#okrArea table#about tr")
>>> typ = content[1].findAll("td")[1].get_text()
>>> print typ
Технікум (училище)
Sign up to request clarification or add additional context in comments.

2 Comments

FWIW, the page is a little bit confusing. It has a <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> but the response header has Encoding: ISO-8859-1. Yes, I realise that's sadly all too common.
@PM2Ring ah, yes, I missed the .text error there. Corrected.
2

Use encoding latin1

Ex:

import requests
import bs4

s = requests.get('http://vstup.info/2017/i2017i483.html')
tmp = s.text.replace("</td></tr></td></tr><tr><td>", "</td></tr><tr><td>")

bs = bs4.BeautifulSoup(tmp, "html.parser")

content = bs.find("div", {"id": "okrArea"}).find("table", {"id": "about"}).findAll("tr")

typ = content[1].findAll("td")[1].get_text() #ZVO type

print typ.encode("latin1")

Output:

Технікум (училище)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.