I have a decently sized .tsv file containing documents in the following format
ID DocType NormalizedName DisplayName Year Description
12648 Book a fancy title A FaNcY-Title 2005 This is a short description of the book
1867453 Essay on the history of humans On the history of humans 2016 This is another short description, this time of the essay
...
The deflated version of this file is around 67 GB in size, compressed around 22GB.
I would like to sort the rows of the file based on the ID (around 300 million lines) in increasing order. Each row's ID is unique and ranges from 1 - 2147483647 (positive part of long), there may be gaps.
Unfortunately, I only have at most 8GB of memory available, so I will not be able to load the entire file at once.
What is the most time efficient manner to sort this list and write it back to disk?
f_out.writelines( heapq.merge([chunk1.readlines(), chunk2.readlines(), ...] )It should be pretty efficient.sortprogram instead (gnu.org/software/coreutils/manual/html_node/…)