1

Is there a way to remove/escape html tags using lxml.html and not beautifulsoup which has some xss issues? I tried using cleaner, but i want to remove all html.

2
  • 7
    How does beautifulsoup have cross-site scripting problems? Commented Oct 19, 2010 at 22:40
  • Maybe they meant CSS. Commented Nov 3, 2018 at 21:36

3 Answers 3

12

Try the .text_content() method on an element, probably best after using lxml.html.clean to get rid of unwanted content (script tags etc...). For example:

from lxml import html
from lxml.html.clean import clean_html

tree = html.parse('http://www.example.com')
tree = clean_html(tree)

text = tree.getroot().text_content()
Sign up to request clarification or add additional context in comments.

5 Comments

I want to get rid of everything, not just unsafe tags
If you want to get rid of everything, why not just text=''? ;-) Seriously, text_content() WILL get rid of all markup, but cleaning will also get rid of eg. css stylesheet rules and javascript, which are also encoded as text inside the element (but I assumed you were only interested in the "real" text, hence the cleanup first)
was using clean_html( string ) which does differnet things
when i use html.fromstring instead of html.parse , i get an error ""AttributeError: 'HtmlElement' object has no attribute 'getroot'""
@kommradHomer: that is because parse() returns an elementtree, but fromstring() returns an element (so you don't need the getroot() in your case)
12

I believe that, this code can help you:

from lxml.html.clean import Cleaner

html_text = "<html><head><title>Hello</title><body>Text</body></html>"
cleaner = Cleaner(allow_tags=[''], remove_unknown_tags=False)
cleaned_text = cleaner.clean_html(html_text)

3 Comments

After a quick experiment this solution seems to be doing a much better job than this one for instance stackoverflow.com/a/5332984/787842, but what I'd like to know more about is the way to properly parametrize the Cleaner object (as there are many, many options); for instance in this case, having an empty allow_tags list and remove_unknown_tags set to False looks to me a bit weird, logically.
@cjauvin: Ofcourse, you are right! It's a kind of hack. But I'm sure no one wants to specify all the tags necessary to remove in the argument remove_tags, if they want to remove all of them. Unfortunately in this case implementation of Cleaner encourages users use allow_tags with remove_unknown_tags for this purposes github.com/lxml/lxml/blob/…
This wraps the result in a div
1

This uses lxml's cleaning functions, but avoids the result being wrapped in an HTML element.

import lxml

doc = lxml.html.document_fromstring(str) 
cleaner = lxml.html.clean.Cleaner(allow_tags=[''], remove_unknown_tags=False)
str = cleaner.clean_html(doc).text_content() 

or as a one liner

lxml.html.clean.Cleaner(allow_tags=[''], remove_unknown_tags=False).clean_html(lxml.html.document_fromstring(str)).text_content()

It works by providing parsing the html manually into a document object, and giving that to the cleaner class. That way clean_html also returns an object rather than a string. Then the text can be recovered without a wrapper element using text_content() method.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.