I'm trying to save the content of a HTML-page in a .html-file, but I only want to save the content under the tag "table". In addition, I'd like to remove all empty tags like <b></b>.
I did all these things already with BeautifulSoup:
f = urllib2.urlopen('http://test.xyz')
html = f.read()
f.close()
soup = BeautifulSoup(html)
txt = ""
for text in soup.find_all("table", {'class': 'main'}):
txt += str(text)
text = BeautifulSoup(text)
empty_tags = text.find_all(lambda tag: tag.name == 'b' and tag.find(True) is None and (tag.string is None or tag.string.strip()==""))
[empty_tag.extract() for empty_tag in empty_tags]
My question is: Is this also possible with lxml? If yes: How would this +/- look like? Thanks a lot for any help.
tables = lxml.html.parse('http://test.xyz').getroot().cssselect('table.main')will get you the<table>elements with class "main".[lxml.html.tostring(t, method="html", encoding=unicode) for t in tables]will get you HTML content (method="text"will give you the text content without tags). What are the empty tags you want to exclude?