Remove encoded HTML tags from large string in Python

Question

I have a JSON file with a "description" key in it, that has lots of HTML tags inside. I would like to erase them. They're encoded, like: <ul> instead of <ul>

I've tried doing text.replace('<.*?>',''), but it doesn't work.
I've also tried with BeautifulSoup doing:

text = soup.get_text()

But it doesn't work neither (it just only decodes the html tags) And finally, I've tried doing:

soup = BeautifulSoup(text)
text = soup.get_text()
text = text.replace('<.*?>','')

Combining that two codes, but the tags won't get deleted...

What I have now in "text" variable (after using beautiful soup that decodes the html tags):
"description":"</li></ul> TESTING AND QUALITY<ul><li>....."

What I want to have in text variable:
"description":"TESTING AND QUALITY"

Your code doesn't work because text.replace() doesn't recognize regular expressions. It's looking for the literal text <.*?>, which of course isn't there. — John Gordon
– John Gordon, Commented Jul 25, 2019 at 21:55
This might be what you're looking for - stackoverflow.com/questions/9662346/… — dreamzboy
– dreamzboy, Commented Jul 25, 2019 at 22:23

lambdawaff · Accepted Answer · 2019-07-25 22:14:13Z

1

You could try using regular expressions instead of replace to discard the HTML tags:

import re

soup = BeautifulSoup(text)
text = soup.get_text()
text = re.sub(r'<.*?>', '', text)

answered Jul 25, 2019 at 22:14

lambdawaff

1416 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Zulu · Accepted Answer · 2019-07-26 01:37:55Z

0

Try using decode_contents() instead

edited Jul 26, 2019 at 1:37

Zulu

9,3539 gold badges51 silver badges57 bronze badges

answered Jul 25, 2019 at 22:09

Adam V

12 bronze badges

Collectives™ on Stack Overflow

Remove encoded HTML tags from large string in Python

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related