1

I am building a scraper using Node.js and Puppeteer. In this case, Puppeteer gets the main content of a page, it is saved as a string, Rss Parser converts it to an RSS feed, an XML file is created, and that file is saved as a physical file containing the scraped content. The problem is if the scraped content contains script elements like Adsense code, it is scraped also. I need a simple regex that will remove any script element along with all of its attributes and all content in between.

I have been looking for a simple example that will allow me to do somethings like:

var content = scrapedcontent;
content = content.replace(myregex, '');

I cannot find an example that works for me. So far the closest things I've found suggest using jQuery. I cannot use jQuery because this is a Node.js project that does not include the jQuery library and I do not want to add jQuery just to strip scripts out of strings.

Also, please do not respond with lectures about what regexes and their characters mean. That is all lorum to me. I just need to find something that says "this is the regex, this is what it does, copy and paste you will be done."

10
  • 4
    Re: your last paragraph, God forbid we ask you to learn anything or do any of the work.... Commented Mar 20, 2021 at 1:29
  • 1
    I don't see why you wouldn't want to use jQuery. It's perfect for things like this as you're manipulating the DOM, and your reasoning of I don't want to add jQuery is amusing when it's a sensible solution to your problem. Using regex to do things with HTML is generally considered to be a bad idea anyways. I found something that might be of use to you, though. Commented Mar 20, 2021 at 2:04
  • 2
    You should probably know that <script> isn't the only way to add javascript to html. Commented Mar 20, 2021 at 2:12
  • 2
    Anything that could be written using jQuery can also be written without jQuery. What's the link to the jQuery example? Commented Mar 20, 2021 at 2:13
  • this seems to work for opening tags and what followed except the closing one /<script[^>]*>/g Commented Mar 20, 2021 at 2:45

1 Answer 1

2

Use https://www.npmjs.com/package/cherio

Implementation of core jQuery designed specifically for the server.

get the element in jQuery style and get rid of them

const cheerio = require('cherio')
const $ = cheerio.load(scrapedcontent);
$('.abc').remove(); // your selector
const newHtml = $.html();
Sign up to request clarification or add additional context in comments.

2 Comments

My question specifically said " this is a Node.js project that does not include the jQuery library and I do not want to add jQuery just to strip scripts out of strings." Yet the first answer I get involves exactly what I asked anyone answering not to do. I am not looking for a way to import jQuery just to strip script tags. I am looking for a regex that works with regular javascript.
@PostAlmostAnything - There is no jQuery in this answer. This is the cheerio library which has it's own jQuery-like functionality designed for server use on parsed HTML content.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.