Decoding HTML entities with Python

Question

I'm trying to decode HTML entries from here NYTimes.com and I cannot figure out what I am doing wrong.

Take for example:

"U.S. Adviser&#8217;s Blunt Memo on Iraq: Time &#8216;to Go Home&#8217;"

I've tried BeautifulSoup, decode('iso-8859-1'), and django.utils.encoding's smart_str without any success.

This questions seems to come up a lot with no good solution. Makes me want to write something of my own... — Kenan Banks
– Kenan Banks, Commented Jul 30, 2009 at 19:49
Ha I think that's the best solution I've found thus far. I might actually try to do that myself. If I do, I'll post my solution. — KeyboardInterrupt
– KeyboardInterrupt, Commented Jul 30, 2009 at 20:01

jfs · Accepted Answer · 2014-10-07 19:13:52Z

22

>>> from HTMLParser import HTMLParser
>>> print HTMLParser().unescape('U.S. Adviser&#8217;s Blunt Memo on Iraq: '
...                             'Time &#8216;to Go Home&#8217;')
U.S. Adviser’s Blunt Memo on Iraq: Time ‘to Go Home’

The function is undocumented in Python 2. It is fixed in Python 3.4+: it is exposed as html.unescape() there.

edited Oct 7, 2014 at 19:13

answered Dec 21, 2013 at 3:37

jfs

417k210 gold badges1k silver badges1.7k bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Daniel Koverman Over a year ago

For future users, this answer appears to have so few upvotes simply because it came 4 years later than the existing answers. It seems to be at least as good an answer. This answer has the advantage that it is simple (unlike writing your own function to to interpret HTML standards using a regex) and uses a standard library (unlike BeautifulSoup). It has the disadvantage that is is using an undocumented function.

Florian · Accepted Answer · 2022-02-03 21:34:48Z

20

Actually what you have are not HTML entities. There are THREE varieties of those &.....; thingies -- for example       all mean U+00A0 NO-BREAK SPACE.

  (the type you have) is a "numeric character reference" (decimal).
  is a "numeric character reference" (hexadecimal).
  is an entity.

Further reading: http://htmlhelp.com/reference/html40/entities/

Here you will find code for Python2.x that does all three in one scan through the input: http://effbot.org/zone/re-sub.htm#unescape-html

edited Feb 3, 2022 at 21:34

Florian

2,4005 gold badges26 silver badges37 bronze badges

answered Jul 30, 2009 at 23:18

John Machin

83.2k12 gold badges147 silver badges193 bronze badges

Comments

Glenn Maynard · Accepted Answer · 2009-07-30 20:37:46Z

18

This does work:

from BeautifulSoup import BeautifulStoneSoup
s = "U.S. Adviser&#8217;s Blunt Memo on Iraq: Time &#8216;to Go Home&#8217;"
decoded = BeautifulStoneSoup(s, convertEntities=BeautifulStoneSoup.HTML_ENTITIES)

If you want a string instead of a Unicode object, you'll need to decode it to an encoding that supports the characters being used; ISO-8859-1 doesn't:

result = decoded.encode("UTF-8")

It's unfortunate that you need an external module for something like this; simple HTML/XML entity decoding should be in the standard library, and not require me to use a library with meaningless class names like "BeautifulStoneSoup". (Class and function names should not be "creative", they should be meaningful.)

edited Jul 30, 2009 at 20:37

answered Jul 30, 2009 at 20:05

Glenn Maynard

57.9k11 gold badges123 silver badges133 bronze badges

5 Comments

Ned Deily Over a year ago

lxml, alas also not in the standard library, also provides a Beautiful Soup parser (and lots more) with somewhat less "creative" names.

John Machin Over a year ago

Support for entity decoding is in the standard library (module htmlentitydefs). What the OP has are (decimal) numeric character references, not entities.

Beni Cherniavsky-Paskin Over a year ago

Works as well with BeautifulSoup instead of BeautifulStoneSoup - one step less "creative" :)

TankorSmash Over a year ago

' names should not be "creative" ' is that a stone cold rule, or just personal choice?

Glenn Maynard Over a year ago

@TankorSmash: There's no authority--beyond the compiler--forcing you to follow any coding standards at all, but this seems like common sense to me.

Evan Fosmark · Accepted Answer · 2009-07-31 06:01:00Z

6

Try this:

import re

def _callback(matches):
    id = matches.group(1)
    try:
        return unichr(int(id))
    except:
        return id

def decode_unicode_references(data):
    return re.sub("&#(\d+)(;|(?=\s))", _callback, data)

data = "U.S. Adviser&#8217;s Blunt Memo on Iraq: Time &#8216;to Go Home&#8217;"
print decode_unicode_references(data)

edited Jul 31, 2009 at 6:01

answered Jul 30, 2009 at 19:50

Evan Fosmark

102k36 gold badges109 silver badges118 bronze badges

5 Comments

KeyboardInterrupt Over a year ago

UnicodeEncodeError: 'charmap' codec can't encode character u'\u2019' in position 12: character maps to <undefined> This seems to be the error I keep getting regardless of what I try.

Evan Fosmark Over a year ago

Could you provide more code, then? I just tried it with the function I wrote and the character 2019 works fine. It shows up as: ߣ

John Machin Over a year ago

A few questions on your regexp: (1) Shouldn't it be \d instead of \w? The regexp will match   and   but then it will crash in int() (2) Allowing the character reference (it's NOT an entity) to end in a whitespace instead of ';' seems very tolerant -- shouldn't you mention this? (3) Wouldn't the last part be better written as [;\s]?

Evan Fosmark Over a year ago

John, you were correct on point one partially. It won't match   since that doesn't start with &#, but yes it should have been \d. Regarding point two to allowing it to end with whitespace, it should be noted that even though it isn't pretty, it's still supported. I've updated the code in the following way: (1) Changed it to \d, (2) made the callback a bit stronger, and (3) used a lookahead assertion for ending whitespace instead of absorbing it like it was.

John Machin Over a year ago

Evan, thanks for the enlightenment, especially about the tolerance of whitespace, which I didn't know about. I got some more clues by looking in the HTML 4.01 and 2.0 specs. They referred to the SGML standard (ISO 8879). Cost = CHF 238(!) so I didn't read it, but HTML 2.0 commented that ';' is only needed when the character following the reference would otherwise be part of the name. Experiments with FF, IE and Opera using space - / X A and & instead of ; all gave the same result: they terminate the reference and are not swallowed. I'm looking forward to your updated solution ;-)

Collectives™ on Stack Overflow

Decoding HTML entities with Python

4 Answers 4

1 Comment

Comments

5 Comments

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

Comments

5 Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related