HTML-parser on Node.js [closed]

Question

Closed. This question is seeking recommendations for software libraries, tutorials, tools, books, or other off-site resources. It does not meet Stack Overflow guidelines. It is not currently accepting answers.

We don’t allow questions seeking recommendations for software libraries, tutorials, tools, books, or other off-site resources. You can edit the question so it can be answered with facts and citations.

Closed 10 years ago.

Improve this question

Is there something like Ruby's nokogiri on nodejs? I mean a user-friendly HTML-parser.

I'd seen on Node.js modules page some parsers, but I can't find something pretty and fresh.

What do you mean by "friendly"? Convenient to work and select nodes with, like Nokogiri's XPath and CSS selector support? Amenable to parsing invalid "tag soup" HTML? — Phrogz
– Phrogz, Commented Nov 2, 2011 at 15:37

Farid Nouri Neshat · Accepted Answer · 2022-08-15 09:34:04Z

466

If you want to build DOM you can use jsdom.

There's also cheerio, it has the jQuery interface and it's a lot faster than older versions of jsdom, although these days they are similar in performance.

You might wanna have a look at htmlparser2, which is a streaming parser, and according to its benchmark, it seems to be faster than others, and no DOM by default. It can also produce a DOM, as it is also bundled with a handler that creates a DOM. This is the parser that is used by cheerio.

parse5 also looks like a good solution. It's fairly active (11 days since the last commit as of this update), WHATWG-compliant, and is used in jsdom, Angular, and Polymer.

If the website you're trying to scrape is dynamic then you should be using a headless browser like phantomjs. Also have a look at casperjs, if you're considering phantomjs. And you can control casperjs from node with SpookyJS.

Beside phantomjs there's zombiejs. Unlike phantomjs that cannot be embedded in nodejs, zombiejs is just a node module.

There's a nettuts+ toturial for the latter solutions.

edited Aug 15, 2022 at 9:34

answered Nov 2, 2011 at 9:27

Farid Nouri Neshat

30.5k6 gold badges80 silver badges128 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

esp Over a year ago

You can get DOM from htmlparser2 using DomHandler module (bundled with htmlparser2). They are separated on purpose to allow for other kinds of processing HTML without overhead of creating DOM.

Farid Nouri Neshat Over a year ago

@esp Thanks, Before I thought it was non-standard DOM, I changed that section accordingly.

dardenfall Over a year ago

I'm not sure how you YQL for crawling - it's more for joining web service results not processing markup.

Farid Nouri Neshat Over a year ago

@dardenfall You are right, crawling is not the right term. I changed it with scraping :)

dardenfall Over a year ago

@Farid - (would've just messaged you if I could) at the risk of debating in comments (sorry!) I still don't see how you use it for scraping. It works with web services not sites, and with wservices, you're rarely parsing html. Maybe xml, but not html.

|

thejh · Accepted Answer · 2011-11-02 09:24:23Z

17

Try https://github.com/tmpvar/jsdom - you give it some HTML and it gives you a DOM.

answered Nov 2, 2011 at 9:24

thejh

45.7k18 gold badges100 silver badges109 bronze badges

Comments

png · Accepted Answer · 2015-02-06 16:40:13Z

6

You can also take a look at x-ray: https://github.com/lapwinglabs/x-ray

answered Feb 6, 2015 at 16:40

png

6,6803 gold badges27 silver badges16 bronze badges

Collectives™ on Stack Overflow

HTML-parser on Node.js [closed]

3 Answers 3

6 Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

6 Comments

Comments

Comments

Linked

Related