0

I have a JSON file with a "description" key in it, that has lots of HTML tags inside. I would like to erase them. They're encoded, like: &lt;ul&gt; instead of <ul>

I've tried doing text.replace('<.*?>',''), but it doesn't work.
I've also tried with BeautifulSoup doing:

text = soup.get_text()

But it doesn't work neither (it just only decodes the html tags) And finally, I've tried doing:

soup = BeautifulSoup(text)
text = soup.get_text()
text = text.replace('<.*?>','')

Combining that two codes, but the tags won't get deleted...

What I have now in "text" variable (after using beautiful soup that decodes the html tags):
"description":"</li></ul><p> </p><p><strong>TESTING AND QUALITY</strong></p><ul><li>....."

What I want to have in text variable:
"description":"TESTING AND QUALITY"

2
  • 1
    Your code doesn't work because text.replace() doesn't recognize regular expressions. It's looking for the literal text <.*?>, which of course isn't there. Commented Jul 25, 2019 at 21:55
  • This might be what you're looking for - stackoverflow.com/questions/9662346/… Commented Jul 25, 2019 at 22:23

2 Answers 2

1

You could try using regular expressions instead of replace to discard the HTML tags:

import re

soup = BeautifulSoup(text)
text = soup.get_text()
text = re.sub(r'<.*?>', '', text)
Sign up to request clarification or add additional context in comments.

Comments

0

Try using decode_contents() instead

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.