29

I need to parse a simple web page and get data from html, such as "src", "data-attr", etc. How can I do this most efficiently using Node.js? If it helps, I'm using Node.js 0.8.x.

P.S. This is the site I'm parsing. I want to get a list of current tracks and make my own html5 app for listen on mobile devices.

4 Answers 4

59

I have done this a lot. You'll want to use PhantomJS if the website that you're scraping is heavily using JavaScript. Note that PhantomJS is not Node.js. It's a completely different JavaScript runtime. You can integrate through phantomjs-node or node-phantom, but they are both kinda hacky. YMMV with those. Avoid anything to do with jsdom. It'll cause you headaches - this includes Zombie.js.

What you should use is Cheerio in conjunction with Request. This will be sufficient for most web pages.

I wrote a blog post on using Cheerio with Request: Quick and Dirty Screen Scraping with Node.js But, again, if it's JavaScript intensive, use PhantomJS in conjunction with CasperJS.

Hope this helps.

Snippet using Request and Cheerio:

var request = require('request')
  , cheerio = require('cheerio');

var searchTerm = 'screen+scraping';
var url = 'http://www.bing.com/search?q=' + searchTerm;

request(url, function(err, resp, body){
  $ = cheerio.load(body);
  links = $('.sb_tlst h3 a'); //use your CSS selector here
  $(links).each(function(i, link){
    console.log($(link).text() + ':\n  ' + $(link).attr('href'));
  });
});
Sign up to request clarification or add additional context in comments.

4 Comments

I have a question for you, Richardson I really hope PhantomJS can achieve this I'm thinking about, so is it possible to Interact with non-same domain site, Like login and post some thread (even on fórum, for example). I'd like to see something like this (C# sample): stackoverflow.com/questions/14000185/…
@jp-richardson is this answer still valid?
@UladzimirHavenchyk yes, these are still my preferred methods.
What about using cheerio with phantomjs+casperjs? Then you get a faster jquery (because all you need is to scrape, not mutate the dom) and browser-side javascript! Or would it be better to just embed jquery all the time?
4

You could try PhantomJS. Here's the documentation for using it for screen scraping.

3 Comments

Is it fast? I think that web-kit load system too heavy.
I'm afraid I haven't used it myself, sorry.
PhantomJS is slow, relatively speaking that is.
3

I agree with @JP Richardson that Cheerio is best for scraping non-JS-heavy sites. For JS-heavy sites, use Casper. It provides great abstractions over Phantom and a promises-style API. They go over how to scrape in their docs: http://docs.casperjs.org/en/latest/quickstart.html.

Comments

0

If you want to go for phantom, use node-phantom. I have a git hub repository using them together to generate pdf files from html if you want to have a look. But i wouldn't go for phantom because it does more than what you usually want and cheerio is faster.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.