Is there a way to remove/escape html tags using lxml.html and not beautifulsoup which has some xss issues? I tried using cleaner, but i want to remove all html.
3 Answers
Try the .text_content() method on an element, probably best after using lxml.html.clean to get rid of unwanted content (script tags etc...). For example:
from lxml import html
from lxml.html.clean import clean_html
tree = html.parse('http://www.example.com')
tree = clean_html(tree)
text = tree.getroot().text_content()
5 Comments
text=''? ;-) Seriously, text_content() WILL get rid of all markup, but cleaning will also get rid of eg. css stylesheet rules and javascript, which are also encoded as text inside the element (but I assumed you were only interested in the "real" text, hence the cleanup first)parse() returns an elementtree, but fromstring() returns an element (so you don't need the getroot() in your case)I believe that, this code can help you:
from lxml.html.clean import Cleaner
html_text = "<html><head><title>Hello</title><body>Text</body></html>"
cleaner = Cleaner(allow_tags=[''], remove_unknown_tags=False)
cleaned_text = cleaner.clean_html(html_text)
3 Comments
Cleaner object (as there are many, many options); for instance in this case, having an empty allow_tags list and remove_unknown_tags set to False looks to me a bit weird, logically.remove_tags, if they want to remove all of them. Unfortunately in this case implementation of Cleaner encourages users use allow_tags with remove_unknown_tags for this purposes github.com/lxml/lxml/blob/…This uses lxml's cleaning functions, but avoids the result being wrapped in an HTML element.
import lxml
doc = lxml.html.document_fromstring(str)
cleaner = lxml.html.clean.Cleaner(allow_tags=[''], remove_unknown_tags=False)
str = cleaner.clean_html(doc).text_content()
or as a one liner
lxml.html.clean.Cleaner(allow_tags=[''], remove_unknown_tags=False).clean_html(lxml.html.document_fromstring(str)).text_content()
It works by providing parsing the html manually into a document object, and giving that to the cleaner class. That way clean_html also returns an object rather than a string. Then the text can be recovered without a wrapper element using text_content() method.