3

I have a strong that I scraped from an XML file and It contains some HTML formatting tags

(<b>, <i>, etc)

Is there a quick and easy way to remove all of these tags from the text?

I tried

str = str.replace("<b>","")

and applied it several times to other tags, but that doesn't work

2
  • Please don't use str as a variable name. Commented Jul 11, 2010 at 19:38
  • Mark, I'm not, I just typed that for the example Commented Jul 11, 2010 at 19:47

3 Answers 3

6

Using lxml.html:

lxml.html.fromstring(s).text_content()

This strips all tags and converts all entities to their corresponding characters.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks! I get AttributeError: 'module' object has no attribute 'html' when I try this though
1

Answer depends on your exact needs. You might have a look at regular expressions. But I would advise you to use http://www.crummy.com/software/BeautifulSoup/ if you want to clean up bad xml or html.

4 Comments

Doesn't sound like he wants to parse any html, just strip it all away so he is left with plain text (kind of like the innerHTML function).
Stephen, you're correct. I'm not trying to parse the string, I just want to remove the HTML formatting (anything inside a <> I want removed completely)
Oops, I meant the innerText property, not the "innerHTML function"
You will not be able to "just" remove the HTML formatting without more sophisticated parsing. Might be possible for some simple samples, but not for complex ones.
1

Here's how to use the BeautifulSoup module to replace only some tags, leaving the rest of the HTML alone:

from BeautifulSoup import BeautifulSoup, NavigableString

def strip_tags(html, invalid_tags):
  soup = BeautifulSoup(html)
  for tag in soup.findAll(True):
    if tag.name in invalid_tags:
      s = ""
      for c in tag.contents:
        if type(c) != NavigableString:
          c = strip_tags(unicode(c), invalid_tags)
        s += unicode(c)
      tag.replaceWith(s)
  return soup

html = "<p>Good, <b>bad</b>, and <i>ug<b>l</b><u>y</u></i></p>"
invalid_tags = ['b', 'i', 'u']
print strip_tags(html, invalid_tags)

Result:

<p>Good, bad, and ugly</p>

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.