Python: Sort a large list that doesn't fit in memory

Question

I have a decently sized .tsv file containing documents in the following format

ID  DocType NormalizedName  DisplayName Year    Description
12648   Book    a fancy title   A FaNcY-Title   2005    This is a short description of the book
1867453 Essay   on the history of humans    On the history of humans    2016    This is another short description, this time of the essay
...

The deflated version of this file is around 67 GB in size, compressed around 22GB.

I would like to sort the rows of the file based on the ID (around 300 million lines) in increasing order. Each row's ID is unique and ranges from 1 - 2147483647 (positive part of long), there may be gaps.

Unfortunately, I only have at most 8GB of memory available, so I will not be able to load the entire file at once.

What is the most time efficient manner to sort this list and write it back to disk?

I would recommend to sort the file by parts and then do final merge of the sorted parts. Look at docs.python.org/3.6/library/heapq.html#heapq.merge — Andrej Kesely
– Andrej Kesely, Commented Jul 9, 2019 at 8:24
@AndrejKesely Yes, that was my initial thought, too. But then you have the sorted parts that you need to subdivide, reorganize, and merge / sort again, which will be something like O(n*log(n)*num_parts*num_reorganizations) or something like that, which seems rather inefficient — FirefoxMetzger
– FirefoxMetzger, Commented Jul 9, 2019 at 8:29
My thought is to split the file to e.g. 1GB chunks, sort each chunk and write it do disk, and then f_out.writelines( heapq.merge([chunk1.readlines(), chunk2.readlines(), ...] ) It should be pretty efficient. — Andrej Kesely
– Andrej Kesely, Commented Jul 9, 2019 at 8:32
Not so pythonic implementation would be moving the data into sqlite and have it automagically sorted via some indexes and "select into" queries. The other option would be reading the file line by line and building up a line offset index of ids, then run a" find by index and copy into some other file" operation by utilising your favorite programming language. I'm aiming at solving the problem, so no efficiency factor considered :) — altunyurt
– altunyurt, Commented Jul 9, 2019 at 8:35
if you're on linux you may want to use the coreutils sort program instead (gnu.org/software/coreutils/manual/html_node/…) — bruno desthuilliers
– bruno desthuilliers, Commented Jul 9, 2019 at 8:42

tikej · Accepted Answer · 2020-07-19 22:35:33Z

5

I made proof of concept using heapq.merge:

Step 1: generate testing file

Generate testing file containing 300 millions of rows:

from random import randint
row = '{} Essay   on the history of humans    On the history of humans    2016    This is another short description, this time of the essay\n'
with open('large_file.tsv', 'w') as f_out:
    for i in range(300_000_000):
        f_out.write(row.format(randint(1, 2147483647)))

Step 2: split into chunks and sort each chunk

Every chunk has 1milion rows:

import glob

path = "chunk_*.tsv"

chunksize = 1_000_000
fid = 1
lines = []

with open('large_file.tsv', 'r') as f_in:
    f_out = open('chunk_{}.tsv'.format(fid), 'w')
    for line_num, line in enumerate(f_in, 1):
        lines.append(line)
        if not line_num % chunksize:
            lines = sorted(lines, key=lambda k: int(k.split()[0]))
            f_out.writelines(lines)

            print('splitting', fid)
            f_out.close()
            lines = []
            fid += 1
            f_out = open('chunk_{}.tsv'.format(fid), 'w')

    # last chunk
    if lines:
        print('splitting', fid)
        lines = sorted(lines, key=lambda k: int(k.split()[0]))
        f_out.writelines(lines)
        f_out.close()
        lines = []

Step 3: merge each chunk

from heapq import merge

chunks = []
for filename in glob.glob(path):
    chunks += [open(filename, 'r')]

with open('sorted.tsv', 'w') as f_out:
    f_out.writelines(merge(*chunks, key=lambda k: int(k.split()[0])))

Timings:

My machine is Ubuntu Linux 18.04, AMD 2400G, Cheap WD SSD Green)

Step 2 - spliting and sorting chunks - took ~12 minutes

Step 3 - merging the chunks - took ~10 minutes

I expect these values much lower on machine with better disk (NVME?) and CPU.

edited Jul 19, 2020 at 22:35

tikej

3524 silver badges19 bronze badges

answered Jul 9, 2019 at 10:06

Andrej Kesely

196k15 gold badges60 silver badges105 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

FirefoxMetzger Over a year ago

Nice! Two thoughts: 1) wouldn't it be faster to directly read/write gziped files? Since the code is probably I/O bound that should give it more rows/s right? 2) wouldn't it be cleaner to do print(line, file=f_out) and let python handle the buffering and writing of the file?

Andrej Kesely Over a year ago

@FirefoxMetzger Yes, the problem is I/O bound. You could try reading from gzip (gzip.open()), do print() instead of write() etc. . Also play with the size of chunks...

tikej Over a year ago

Grate solution, I was looking for something like this. However I get a problem with your code, the output 'sorted.tsv' file is sorted by only up to half of it's content, what might be the case?

Andrej Kesely Over a year ago

@tikej Check if the sorted()/merge() key= function is set-up correctly, also check if you get correct number of chunks and each chunk is sorted. Otherwise, I don't see a reason why only half of file is sorted and other half not.

Andrej Kesely Over a year ago

@tikej Now I tested the code with 3_500_000 lines and it produced correct result. Also check that you closed all file objects before doing the merge.

Collectives™ on Stack Overflow

Python: Sort a large list that doesn't fit in memory

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related