4

I wrote a little Node.js script to scrape data from a website on which I'm iterating through pages to extract structured data.

The data I extract for each page is an a form of an array of objects.

I thought I could use fs.createWriteStream() method to create a writable stream on which I could write the data incrementally after each page extraction.

Apparently, you can only write a String or a Buffer to the stream, so I'm doing something like this:

output.write(JSON.stringify(operations, null, 2));

But in the end, once I close the stream, the JSON is malformatted because obvisously I just appended every array of each page one after the other, resulting in something looking like this:

[
    { ... },  /* data for page 1 */
    { ... }
][ /* => here is the problem */
    { ... },  /* data for page 2 */
    { ... }
]

How could I proceed to actually append the arrays into the output instead of chaining them? Is it even do-able?

5
  • Is there a reason you're not writing it all in one go, after processing all pages? Anything else seems a little hacky. Commented Jan 25, 2018 at 12:41
  • It's because the entire array is pretty big in the end, so keeping it in memory could end up impacting performance I guess (not sure, it's just a pet project for me to learn how streams work) Commented Jan 25, 2018 at 13:03
  • The issue is that there is no way to have partial (valid) json data. It's not really something that can be streamed in any meaningful way. If the array is so large that it might impact performance, then the final resulting json file will be too large to be effectively consumed anyway. Commented Jan 25, 2018 at 13:12
  • If you really need to output progressively, then just output each "operation" object individually and write out the square brackets and commas as required. Commented Jan 25, 2018 at 13:16
  • Alright, so far I have a few possibilities to try: 1) write the file entirely at the end of the process, 2) build up the JSON manually as a string when writing to the stream (as you suggested) and 3) as a last last resort, write one file per page. Thanks for your help, I will dig into these solutions and find out which one works best ;) Commented Jan 25, 2018 at 13:22

1 Answer 1

9

Your options would be...

  1. Keep full array in memory and only write to the json file at the end, after processing all pages.
  2. Write each object individually, and handle the square brackets and commas manually.

Something like this...

//start processing
output.write('[');
//loop through your pages, however you're doing that
while (more_data_to_read()) {
    //create "operation" object
    var operation = get_operation_object();
    output.write(JSON.stringify(operation, null, 2));
    if (!is_last_page()) {
        //write out comma to separate operation objects within array
        output.write(',');
    }
}
//all done, close the json array
output.write(']');

This will create well-formed json.

Personally, I would opt for #1 though, as it seems the more 'correct' way to do it. If you're concerned about the array using too much memory, then json may not be the best choice for the data file. It's not particularly well suited to extremely large data-sets.

In the code sample above, if the process got interrupted partway through, then you'll have an invalid json file, so writing progressively won't actually make the application more fault-tolerant.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks, I ended up writing the file at the end of the processing (option 1) which is indeed the safest thing to do. It seems to be performing well for now, so the lesson here is not trying to over-optimize too early :)
1st approach doesn't work if array size is huge. In my case it was of having more than 1.5 million objects. The JSON.stringify was throwing error of JAVASCRIPT head out of memory.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.