Removing html tags and entities from string in python

Question

I am getting xml data from api.careerbuilder.com Particularly, the string contains some html entities I am willing to remove, to no effect!

I have tried doing this:

import re
re.sub('\&amp;lt;.*?\&amp;gt;', '', job_title_text)

and this

from html.parser import HTMLParser
class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

strip_tags(job_title_text)

and finally this

import lxml.html
(lxml.html.fromstring(job_title_text)).text_content()

But all of these were failures. The second approach deleted html entities like "&amp" but the text inside the tags was left, that is "pbrspan", for example. Third one completely ruined everything, no data was shown at all, instead

< bound method HtmlElement.text_content of < Element html at 0x33717d8> >

Finally, I suspect, that the regex I have written is entirely wrong. Any ideas, how this can be handled?

text_content is a method, not an attribute -- meaning you need to call it (text_content()) for it to yield anything useful. — Max Noel
– Max Noel, Commented Dec 24, 2013 at 21:03

arm.localhost · Accepted Answer · 2013-12-24 20:19:49Z

1

Try this regular expression

(\&lt\;).*?(\&gt\;)

answered Dec 24, 2013 at 20:19

arm.localhost

4995 silver badges10 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Ali SAID OMAR · Accepted Answer · 2013-12-24 20:55:25Z

0

Consider to use BeautifulSoup to remove tags, pretty well documented, http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html#Removing%20elements

answered Dec 24, 2013 at 20:55

Ali SAID OMAR

6,8629 gold badges42 silver badges57 bronze badges

Collectives™ on Stack Overflow

Removing html tags and entities from string in python

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related