0

This usually is no hard task, but today I can't seem to remove a simple javascript tag..

The example I'm working with (formated):

<section class="realestate oca"></section>
<script type="text/javascript" data-type="ad">
    window.addEventListener('DOMContentLoaded', function(){
        window.postscribe && postscribe(document.querySelector(".realestate"),
        '<script src="https://ocacache-front.schibsted.tech/public/dist/oca-loader/js/ocaloader.js?type=re&w=100%&h=300"><\/script>');
    });
</script>

The example I'm working with (raw)

<section class="realestate oca"></section>\n<script type="text/javascript" data-type="ad">\n\twindow.addEventListener(\'DOMContentLoaded\', function(){\n\t\twindow.postscribe && postscribe(document.querySelector(".realestate"),\n\t\t\'<script src="https://ocacache-front.schibsted.tech/public/dist/oca-loader/js/ocaloader.js?type=re&w=100%&h=300"><\\/script>\');\n\t});\n</script>

I would like to remove everything from <script(beginning of second line) to </script>(last line). This will output only the first line, <section..>.

Here's my line of code:

re.sub(r'<script[^</script>]+</script>', '', text)
#or
re.sub(r'<script.+?</script>', '', text)

I'm clearly missing something, but I can't see what.
Note: The document I'm working with contains mainly plain text so no parsing with lxml or similar is needed.

5
  • You should know this [^</script>] doesn't mean anything except a closing script tag. Commented Feb 13, 2017 at 14:13
  • @glibdud I agree, I was only trying to flag it. meta.stackoverflow.com/q/343643/1561176 Commented Feb 13, 2017 at 14:21
  • I think that you should take a look at this answer to using regex to parse "html" stackoverflow.com/a/1732454/1561176 . Instead you should be using the correct parser, such as BeautifulSoup. crummy.com/software/BeautifulSoup Commented Feb 13, 2017 at 14:26
  • @revo Well, if I knew, I wouldn't be asking. Either way, I read somewhere it ment "anything except this" and I'm using it a lot like this ´<[^>]+>´ . Commented Feb 13, 2017 at 14:26
  • @InbarRose That made an impression I won't forget. I don't think my document will be able to be parsed, I see it more fit to manually index the tags, group them and then delete everything inbetween. Commented Feb 13, 2017 at 14:36

1 Answer 1

3

Your first regex didn't work because character classes ([...]) are a collection of characters, not a string. So it will only match if it finds <script separated from </script> by a string of characters that doesn't include any of <, /, s, c, etc.

Your second regex is better, and the only reason it's not working is because by default, the . wildcard does not match newlines. To tell it you want it to, you'll need to add the DOTALL flag:

re.sub(r'<script.+?</script>', '', text, flags=re.DOTALL)
Sign up to request clarification or add additional context in comments.

1 Comment

Amazing. Thanks for giving an explanation to why it didn't work!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.