0

I'm pretty new with regular expression. Basically, I would like to use regular expression to remove <sup> ... </sup> from the string using regular expression.

Input:

<b>something here</b><sup>1</sup><sup>,3</sup>, another here<sup>1</sup>

Output:

<b>something here</b>, another here

Is that a short way and description on how to do it?

note This question might be duplicated. I tried but couldn't find solution.

2
  • 3
    Regex is not the way to deal with html, use an html parser. html isn't a simple string, it's structured data. The most easy to use is beautifulsoup, but it's only a wrapper for more efficient libraries you can use too. Commented Aug 19, 2016 at 19:40
  • I have list of short string like above. I guess using regular expression will work without using html parser Commented Aug 19, 2016 at 19:43

2 Answers 2

1

The hard part is knowing how to do a minimal rather than maximal match of the stuff between the tags. This works.

import re
s0 = "<b>something here</b><sup>1</sup><sup>,3</sup>, another here<sup>1</sup>"
prog = re.compile('<sup>.*?</sup>')
s1 = re.sub(prog, '', s0)
print(s1)
# <b>something here</b>, another here
Sign up to request clarification or add additional context in comments.

1 Comment

Beaten by Ryan with same answer.
1

You could do something like this:

import re
s = "<b>something here</b><sup>1</sup><sup>,3</sup>, another here<sup>1</sup>"

s2 = re.sub(r'<sup>(.*?)</sup>',"", s)

print s2
# Prints: <b>something here</b>, another here

Remember to use (.*?), as (.*) is what they call a greedy quantifier and you would obtain a different result:

s2 = re.sub(r'<sup>(.*)</sup>',"", s)

print s2
# Prints: <b>something here</b>

1 Comment

Thanks @Ryan, this is exactly what I'm looking for.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.