0

I'm trying to save the content of a HTML-page in a .html-file, but I only want to save the content under the tag "table". In addition, I'd like to remove all empty tags like <b></b>. I did all these things already with BeautifulSoup:

f = urllib2.urlopen('http://test.xyz')
html = f.read()
f.close()
soup = BeautifulSoup(html)

txt = ""

for text in soup.find_all("table", {'class': 'main'}):
txt += str(text)

text = BeautifulSoup(text)
empty_tags = text.find_all(lambda tag: tag.name == 'b' and tag.find(True) is None and (tag.string is None or tag.string.strip()=="")) 
[empty_tag.extract() for empty_tag in empty_tags]

My question is: Is this also possible with lxml? If yes: How would this +/- look like? Thanks a lot for any help.

3
  • tables = lxml.html.parse('http://test.xyz').getroot().cssselect('table.main') will get you the <table> elements with class "main". [lxml.html.tostring(t, method="html", encoding=unicode) for t in tables] will get you HTML content (method="text" will give you the text content without tags). What are the empty tags you want to exclude? Commented Aug 25, 2013 at 22:23
  • Thanks for your reply. Empty tags are just tags with no content, for example: <i></i><i></i><b></b> Commented Aug 25, 2013 at 22:31
  • Thanks a lot! I commented it :) Commented Aug 26, 2013 at 0:45

1 Answer 1

3
import lxml.html

# lxml can download pages directly
root = lxml.html.parse('http://test.xyz').getroot()

# use a CSS selector for class="main",
# or use root.xpath('//table[@class="main"]')
tables = root.cssselect('table.main')

# extract HTML content from all tables
# use lxml.html.tostring(t, method="text", encoding=unicode)
# to get text content without tags
"\n".join([lxml.html.tostring(t) for t in tables])

# removing only specific empty tags, here <b></b> and <i></i>
for empty in root.xpath('//*[self::b or self::i][not(node())]'):
    empty.getparent().remove(empty)

# removing all empty tags (tags that do not have children nodes)
for empty in root.xpath('//*[not(node())]'):
    empty.getparent().remove(empty)
# root does not contain those empty tags anymore
Sign up to request clarification or add additional context in comments.

7 Comments

Thanks a lot for this reply! :) Is it possible to remove specific empty tags (for example b: <b></b>) ? And is it possible to substitute errors like "&amp" by ""?
Edited the answer with expression to remove only specific empty tags. To remove "&amp" you'll be better off using a regular expression like re.sub("&[^\s;]+\s", "", mystring) (probably needs some further testing)
I've got another question: The above code has a problem handling more than one 'table.main'-tags. If we have in the original something like: <table class="main"> hi there <table class="main"> what's up? </table> byebye </table> , the output-file will contain the following: <table class="main"> hi there <table class="main"> what's up? </table> byebye </table> <table class="main"> what's up? </table>. So the problem is that the content of the middle part appears one time too much at the end. Why is this and how can this be fixed? Thanks a lot for any help. :)
I feel it's really another question as the BeautifulSoup code in the question would also have the same problem with soup.find_all("table", {'class': 'main'}):. But I think you could change root.cssselect('table.main') to XPath with something like root.xpath('//table[@class="main" and not(.//table[@class="main"])]'), meaning select all <table> from the root node that have class "main" but that do not have a descendant <table> with the class "main" (Note: I haven't tested it (yet))
If I tested this correctly, the additional lines extracts the part I'd like to remove. :) In our example, my output file looks now as following: <table class="main"> what's up? </table>
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.