Removing html tags using python?

Question

I know there have probably been a million questions on this, but I'm wondering how to remove these tags without having to import or use HTMLParser or regex. I've tried a bunch of different replace statements to try and remove parts of the strings enclosed by < >'s, to no avail.

Basically what I'm working with is:

response = urlopen(url)
html = response.read()
html = html.decode()

From here I am just trying to manipulate the string variable html to do the above. Is there any way to do it as i specified, or must you use previous methods I have seen?

I also tried to make a for loop that went through every character to check if it was enclosed, but for some reason it wouldn't give me a proper print out, that was:

for i in html:
    if i == '<':
        html.replace(i, '')
        delete = True
    if i == '>':
        html.replace(i, '')
        delete = False
    if delete == True:
        html.replace(i, '')

Would appreciate any input.

Please don't use regex for parsing HTML. It won't work, see stackoverflow.com/questions/1732348/… for an amusing explanation. — Joan Smith
– Joan Smith, Commented Feb 26, 2014 at 14:06
without having to import or use HTMLParser or regex. why do you give yourself such silly restrictions. — Burhan Khalid
– Burhan Khalid, Commented Feb 26, 2014 at 14:13

ndpu · Accepted Answer · 2014-02-26 14:30:48Z

1

str.replace returns a copy of the string with all occurrences of substring replaced by new, you cant use it like you do and you shouldnt modify string on which your loop is iterating anyway. Using of extra list is one of the ways you can go:

txt = []
for i in html:
    if i == '<':
        delete = True
        continue
    if i == '>':
        delete = False
        continue
    if delete == True:
        continue

    txt.append(i)

now txt list contains result text, you can join it:

print ''.join(txt)

Demo:

html = '<body><div>some</div><div>text</div></body>'
#...
>>> txt
['s', 'o', 'm', 'e', 't', 'e', 'x', 't']
>>> ''.join(txt)
'sometext'

edited Feb 26, 2014 at 14:30

answered Feb 26, 2014 at 14:11

ndpu

22.6k6 gold badges61 silver badges72 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

user2909869 Over a year ago

Thanks, I've been looking for a way to do this without having to use some pre-implemented method as I don't really learn anything from that.

Collectives™ on Stack Overflow

Removing html tags using python?

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related