Remove all html in python?

Question

Is there a way to remove/escape html tags using lxml.html and not beautifulsoup which has some xss issues? I tried using cleaner, but i want to remove all html.

How does beautifulsoup have cross-site scripting problems?

jball
– jball

2010-10-19 22:40:02 +00:00
Commented Oct 19, 2010 at 22:40 — jball
– jball, Commented Oct 19, 2010 at 22:40
Maybe they meant CSS.

jrc
– jrc

2018-11-03 21:36:27 +00:00
Commented Nov 3, 2018 at 21:36 — jrc
– jrc, Commented Nov 3, 2018 at 21:36

Steven · Accepted Answer · 2010-10-20 08:23:56Z

12

Try the .text_content() method on an element, probably best after using lxml.html.clean to get rid of unwanted content (script tags etc...). For example:

from lxml import html
from lxml.html.clean import clean_html

tree = html.parse('http://www.example.com')
tree = clean_html(tree)

text = tree.getroot().text_content()

answered Oct 20, 2010 at 8:23

Steven

28.9k6 gold badges64 silver badges51 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Timmy Over a year ago

I want to get rid of everything, not just unsafe tags

Steven Over a year ago

If you want to get rid of everything, why not just text=''? ;-) Seriously, text_content() WILL get rid of all markup, but cleaning will also get rid of eg. css stylesheet rules and javascript, which are also encoded as text inside the element (but I assumed you were only interested in the "real" text, hence the cleanup first)

Timmy Over a year ago

was using clean_html( string ) which does differnet things

kommradHomer Over a year ago

when i use html.fromstring instead of html.parse , i get an error ""AttributeError: 'HtmlElement' object has no attribute 'getroot'""

Steven Over a year ago

@kommradHomer: that is because parse() returns an elementtree, but fromstring() returns an element (so you don't need the getroot() in your case)

dni · Accepted Answer · 2013-03-22 13:52:14Z

12

I believe that, this code can help you:

from lxml.html.clean import Cleaner

html_text = "<html><head><title>Hello</title><body>Text</body></html>"
cleaner = Cleaner(allow_tags=[''], remove_unknown_tags=False)
cleaned_text = cleaner.clean_html(html_text)

answered Mar 22, 2013 at 13:52

dni

1311 silver badge3 bronze badges

3 Comments

cjauvin Over a year ago

After a quick experiment this solution seems to be doing a much better job than this one for instance stackoverflow.com/a/5332984/787842, but what I'd like to know more about is the way to properly parametrize the Cleaner object (as there are many, many options); for instance in this case, having an empty allow_tags list and remove_unknown_tags set to False looks to me a bit weird, logically.

dni Over a year ago

@cjauvin: Ofcourse, you are right! It's a kind of hack. But I'm sure no one wants to specify all the tags necessary to remove in the argument remove_tags, if they want to remove all of them. Unfortunately in this case implementation of Cleaner encourages users use allow_tags with remove_unknown_tags for this purposes github.com/lxml/lxml/blob/…

cmc Over a year ago

This wraps the result in a div

cmc · Accepted Answer · 2019-01-16 12:06:48Z

1

This uses lxml's cleaning functions, but avoids the result being wrapped in an HTML element.

import lxml

doc = lxml.html.document_fromstring(str) 
cleaner = lxml.html.clean.Cleaner(allow_tags=[''], remove_unknown_tags=False)
str = cleaner.clean_html(doc).text_content()

or as a one liner

lxml.html.clean.Cleaner(allow_tags=[''], remove_unknown_tags=False).clean_html(lxml.html.document_fromstring(str)).text_content()

It works by providing parsing the html manually into a document object, and giving that to the cleaner class. That way clean_html also returns an object rather than a string. Then the text can be recovered without a wrapper element using text_content() method.

answered Jan 16, 2019 at 12:06

cmc

4,4232 gold badges39 silver badges37 bronze badges

Collectives™ on Stack Overflow

Remove all html in python?

3 Answers 3

5 Comments

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related