Parsing HTML with lxml (python)

Question

I'm trying to save the content of a HTML-page in a .html-file, but I only want to save the content under the tag "table". In addition, I'd like to remove all empty tags like <b></b>. I did all these things already with BeautifulSoup:

f = urllib2.urlopen('http://test.xyz')
html = f.read()
f.close()
soup = BeautifulSoup(html)

txt = ""

for text in soup.find_all("table", {'class': 'main'}):
txt += str(text)

text = BeautifulSoup(text)
empty_tags = text.find_all(lambda tag: tag.name == 'b' and tag.find(True) is None and (tag.string is None or tag.string.strip()=="")) 
[empty_tag.extract() for empty_tag in empty_tags]

My question is: Is this also possible with lxml? If yes: How would this +/- look like? Thanks a lot for any help.

tables = lxml.html.parse('http://test.xyz').getroot().cssselect('table.main') will get you the <table> elements with class "main". [lxml.html.tostring(t, method="html", encoding=unicode) for t in tables] will get you HTML content (method="text" will give you the text content without tags). What are the empty tags you want to exclude? — paul trmbrth
– paul trmbrth, Commented Aug 25, 2013 at 22:23
Thanks for your reply. Empty tags are just tags with no content, for example: <i></i><i></i><b></b> — MarkF6
– MarkF6, Commented Aug 25, 2013 at 22:31

paul trmbrth · Accepted Answer · 2013-08-30 10:20:24Z

3

import lxml.html

# lxml can download pages directly
root = lxml.html.parse('http://test.xyz').getroot()

# use a CSS selector for class="main",
# or use root.xpath('//table[@class="main"]')
tables = root.cssselect('table.main')

# extract HTML content from all tables
# use lxml.html.tostring(t, method="text", encoding=unicode)
# to get text content without tags
"\n".join([lxml.html.tostring(t) for t in tables])

# removing only specific empty tags, here <b></b> and <i></i>
for empty in root.xpath('//*[self::b or self::i][not(node())]'):
    empty.getparent().remove(empty)

# removing all empty tags (tags that do not have children nodes)
for empty in root.xpath('//*[not(node())]'):
    empty.getparent().remove(empty)
# root does not contain those empty tags anymore

edited Aug 30, 2013 at 10:20

answered Aug 25, 2013 at 22:52

paul trmbrth

20.8k4 gold badges56 silver badges67 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

MarkF6 Over a year ago

Thanks a lot for this reply! :) Is it possible to remove specific empty tags (for example b: <b></b>) ? And is it possible to substitute errors like "&amp" by ""?

paul trmbrth Over a year ago

Edited the answer with expression to remove only specific empty tags. To remove "&amp" you'll be better off using a regular expression like re.sub("&[^\s;]+\s", "", mystring) (probably needs some further testing)

MarkF6 Over a year ago

I've got another question: The above code has a problem handling more than one 'table.main'-tags. If we have in the original something like: <table class="main"> hi there <table class="main"> what's up? </table> byebye </table> , the output-file will contain the following: <table class="main"> hi there <table class="main"> what's up? </table> byebye </table> <table class="main"> what's up? </table>. So the problem is that the content of the middle part appears one time too much at the end. Why is this and how can this be fixed? Thanks a lot for any help. :)

paul trmbrth Over a year ago

I feel it's really another question as the BeautifulSoup code in the question would also have the same problem with soup.find_all("table", {'class': 'main'}):. But I think you could change root.cssselect('table.main') to XPath with something like root.xpath('//table[@class="main" and not(.//table[@class="main"])]'), meaning select all <table> from the root node that have class "main" but that do not have a descendant <table> with the class "main" (Note: I haven't tested it (yet))

MarkF6 Over a year ago

If I tested this correctly, the additional lines extracts the part I'd like to remove. :) In our example, my output file looks now as following: <table class="main"> what's up? </table>

|

Collectives™ on Stack Overflow

Parsing HTML with lxml (python)

1 Answer 1

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related