Remove html tag and string in between in Python

Question

I'm pretty new with regular expression. Basically, I would like to use regular expression to remove <sup> ... </sup> from the string using regular expression.

Input:

<b>something here</b><sup>1</sup><sup>,3</sup>, another here<sup>1</sup>

Output:

<b>something here</b>, another here

Is that a short way and description on how to do it?

note This question might be duplicated. I tried but couldn't find solution.

Regex is not the way to deal with html, use an html parser. html isn't a simple string, it's structured data. The most easy to use is beautifulsoup, but it's only a wrapper for more efficient libraries you can use too. — Casimir et Hippolyte
– Casimir et Hippolyte, Commented Aug 19, 2016 at 19:40
I have list of short string like above. I guess using regular expression will work without using html parser — titipata
– titipata, Commented Aug 19, 2016 at 19:43

Terry Jan Reedy · Accepted Answer · 2016-08-19 19:52:47Z

1

The hard part is knowing how to do a minimal rather than maximal match of the stuff between the tags. This works.

import re
s0 = "<b>something here</b><sup>1</sup><sup>,3</sup>, another here<sup>1</sup>"
prog = re.compile('<sup>.*?</sup>')
s1 = re.sub(prog, '', s0)
print(s1)
# <b>something here</b>, another here

answered Aug 19, 2016 at 19:52

Terry Jan Reedy

19.3k3 gold badges44 silver badges57 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Terry Jan Reedy Over a year ago

Beaten by Ryan with same answer.

Ryan · Accepted Answer · 2016-08-19 19:52:02Z

1

You could do something like this:

import re
s = "<b>something here</b><sup>1</sup><sup>,3</sup>, another here<sup>1</sup>"

s2 = re.sub(r'<sup>(.*?)</sup>',"", s)

print s2
# Prints: <b>something here</b>, another here

Remember to use (.*?), as (.*) is what they call a greedy quantifier and you would obtain a different result:

s2 = re.sub(r'<sup>(.*)</sup>',"", s)

print s2
# Prints: <b>something here</b>

edited Aug 19, 2016 at 19:52

answered Aug 19, 2016 at 19:48

Ryan

2,1832 gold badges30 silver badges33 bronze badges

1 Comment

titipata Over a year ago

Thanks @Ryan, this is exactly what I'm looking for.

Collectives™ on Stack Overflow

Remove html tag and string in between in Python

2 Answers 2

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related