2

I know there have probably been a million questions on this, but I'm wondering how to remove these tags without having to import or use HTMLParser or regex. I've tried a bunch of different replace statements to try and remove parts of the strings enclosed by < >'s, to no avail.

Basically what I'm working with is:

response = urlopen(url)
html = response.read()
html = html.decode()

From here I am just trying to manipulate the string variable html to do the above. Is there any way to do it as i specified, or must you use previous methods I have seen?

I also tried to make a for loop that went through every character to check if it was enclosed, but for some reason it wouldn't give me a proper print out, that was:

for i in html:
    if i == '<':
        html.replace(i, '')
        delete = True
    if i == '>':
        html.replace(i, '')
        delete = False
    if delete == True:
        html.replace(i, '')

Would appreciate any input.

2
  • Please don't use regex for parsing HTML. It won't work, see stackoverflow.com/questions/1732348/… for an amusing explanation. Commented Feb 26, 2014 at 14:06
  • without having to import or use HTMLParser or regex. why do you give yourself such silly restrictions. Commented Feb 26, 2014 at 14:13

1 Answer 1

1

str.replace returns a copy of the string with all occurrences of substring replaced by new, you cant use it like you do and you shouldnt modify string on which your loop is iterating anyway. Using of extra list is one of the ways you can go:

txt = []
for i in html:
    if i == '<':
        delete = True
        continue
    if i == '>':
        delete = False
        continue
    if delete == True:
        continue

    txt.append(i)

now txt list contains result text, you can join it:

print ''.join(txt)

Demo:

html = '<body><div>some</div><div>text</div></body>'
#...
>>> txt
['s', 'o', 'm', 'e', 't', 'e', 'x', 't']
>>> ''.join(txt)
'sometext'
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks, I've been looking for a way to do this without having to use some pre-implemented method as I don't really learn anything from that.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.