2

I'm using NodeJS to do web scraping.

I have a complex HTML string. It contains a number of html tags and a few jave script blocks. Each javascript block contains js functions with a few parameters, and each parameter is a Json string. I'm only interested in those Json strings. What's the best way to extract them?

Sample code:

<html>
    <header>...</header>
    <script>function1(param1:[{a:"V1"},{b:"v2"}],param2:[{c:"v3"},{d:"v4"}])</script> 
    <script>...</script>
    <body>...</body>
</html>

Appreciate your advice.

4
  • If you learn regular expressions in javascript then you should be able to find these strings with a few lines of code only. Commented May 27, 2014 at 8:23
  • Thanks, Trilarion. I'm a bit hesitating going down the path of regex. The script content is totally dynamic. It may contain uncertain number of functions, each function main contain uncertain number of parameters and each parameter may be an array of uncertain length. 1. I'm worrying about the complexity of the regex. 2. Even if such regex can be prepared, won't it be too CPU intensive? If yes, it won't be a great choice for nodejs then. Commented May 27, 2014 at 10:10
  • Especially if the structure is complex regex seems like the tool since any other solution will have to be somewhat complex as well. The advantage would be that you rely on a well tested and powerful tool. Commented May 27, 2014 at 10:44
  • If you use an XML library to extract the contents of the script tags, you could use something like esprima.org to parse the code and extract the information you need. Commented May 27, 2014 at 11:58

1 Answer 1

1

First, parse the html with cheerio. This will allow you to correctly extract the javascript text from within the <script> tags using jQuery syntax a la $('script').text() (you'll want to loop through all of the script tags presumably though). Once you have the javascript itself extracted, use esprima to parse the javascript, find all the function calls, and find all the arguments that are literals. These two libraries will work more correctly than hacking something together with regular expressions. Start small, post a code snippet, and come back for help if you get stuck.

Sign up to request clarification or add additional context in comments.

3 Comments

Peter, this looks cool! I was working on cheerio and jquery already before I seeing your post, was pulling my hair to figure out a smart way to parse the js calls! Will explore esprima and update. : )
Peter, here is what I did. 1. Use esprima to parse the function and identify the parameter names. 2. Use String.js package to take substring between two parameters to get the actual Json string. I'm doing this because esprima parses the Json into its own format as well. It's pretty tedious to convert it back to the original Json format. Let me know if you think there is a better way. Thanks!
Nice. Would be great if you would post a snippet so others can see your solution.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.