Removing HTML tags from a unicode string in Python

Question

I have a strong that I scraped from an XML file and It contains some HTML formatting tags

(<b>, <i>, etc)

Is there a quick and easy way to remove all of these tags from the text?

I tried

str = str.replace("<b>","")

and applied it several times to other tags, but that doesn't work

Please don't use str as a variable name.

Mark Byers
– Mark Byers

2010-07-11 19:38:31 +00:00
Commented Jul 11, 2010 at 19:38 — Mark Byers
– Mark Byers, Commented Jul 11, 2010 at 19:38
Mark, I'm not, I just typed that for the example

Alex B
– Alex B

2010-07-11 19:47:20 +00:00
Commented Jul 11, 2010 at 19:47 — Alex B
– Alex B, Commented Jul 11, 2010 at 19:47

user355252 · Accepted Answer · 2010-07-11 19:49:16Z

6

Using lxml.html:

lxml.html.fromstring(s).text_content()

This strips all tags and converts all entities to their corresponding characters.

answered Jul 11, 2010 at 19:49

user355252

Sign up to request clarification or add additional context in comments.

1 Comment

Alex B Over a year ago

Thanks! I get AttributeError: 'module' object has no attribute 'html' when I try this though

Achim · Accepted Answer · 2010-07-11 19:35:36Z

1

Answer depends on your exact needs. You might have a look at regular expressions. But I would advise you to use http://www.crummy.com/software/BeautifulSoup/ if you want to clean up bad xml or html.

answered Jul 11, 2010 at 19:35

Achim

15.7k15 gold badges92 silver badges161 bronze badges

4 Comments

Stephen Swensen Over a year ago

Doesn't sound like he wants to parse any html, just strip it all away so he is left with plain text (kind of like the innerHTML function).

Alex B Over a year ago

Stephen, you're correct. I'm not trying to parse the string, I just want to remove the HTML formatting (anything inside a <> I want removed completely)

Stephen Swensen Over a year ago

Oops, I meant the innerText property, not the "innerHTML function"

Achim Over a year ago

You will not be able to "just" remove the HTML formatting without more sophisticated parsing. Might be possible for some simple samples, but not for complex ones.

2 revs · Accepted Answer · 2010-07-12 03:21:55Z

1

Here's how to use the BeautifulSoup module to replace only some tags, leaving the rest of the HTML alone:

from BeautifulSoup import BeautifulSoup, NavigableString

def strip_tags(html, invalid_tags):
  soup = BeautifulSoup(html)
  for tag in soup.findAll(True):
    if tag.name in invalid_tags:
      s = ""
      for c in tag.contents:
        if type(c) != NavigableString:
          c = strip_tags(unicode(c), invalid_tags)
        s += unicode(c)
      tag.replaceWith(s)
  return soup

html = "<p>Good, <b>bad</b>, and <i>ug<b>l</b><u>y</u></i></p>"
invalid_tags = ['b', 'i', 'u']
print strip_tags(html, invalid_tags)

Result:

<p>Good, bad, and ugly</p>

edited Jul 12, 2010 at 3:21

community wiki

2 revs
Jesse Dhillon

Collectives™ on Stack Overflow

Removing HTML tags from a unicode string in Python

3 Answers 3

1 Comment

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related