1

I have a decently sized .tsv file containing documents in the following format

ID  DocType NormalizedName  DisplayName Year    Description
12648   Book    a fancy title   A FaNcY-Title   2005    This is a short description of the book
1867453 Essay   on the history of humans    On the history of humans    2016    This is another short description, this time of the essay
...

The deflated version of this file is around 67 GB in size, compressed around 22GB.

I would like to sort the rows of the file based on the ID (around 300 million lines) in increasing order. Each row's ID is unique and ranges from 1 - 2147483647 (positive part of long), there may be gaps.

Unfortunately, I only have at most 8GB of memory available, so I will not be able to load the entire file at once.

What is the most time efficient manner to sort this list and write it back to disk?

6
  • 1
    I would recommend to sort the file by parts and then do final merge of the sorted parts. Look at docs.python.org/3.6/library/heapq.html#heapq.merge Commented Jul 9, 2019 at 8:24
  • @AndrejKesely Yes, that was my initial thought, too. But then you have the sorted parts that you need to subdivide, reorganize, and merge / sort again, which will be something like O(n*log(n)*num_parts*num_reorganizations) or something like that, which seems rather inefficient Commented Jul 9, 2019 at 8:29
  • 1
    My thought is to split the file to e.g. 1GB chunks, sort each chunk and write it do disk, and then f_out.writelines( heapq.merge([chunk1.readlines(), chunk2.readlines(), ...] ) It should be pretty efficient. Commented Jul 9, 2019 at 8:32
  • 2
    Not so pythonic implementation would be moving the data into sqlite and have it automagically sorted via some indexes and "select into" queries. The other option would be reading the file line by line and building up a line offset index of ids, then run a" find by index and copy into some other file" operation by utilising your favorite programming language. I'm aiming at solving the problem, so no efficiency factor considered :) Commented Jul 9, 2019 at 8:35
  • if you're on linux you may want to use the coreutils sort program instead (gnu.org/software/coreutils/manual/html_node/…) Commented Jul 9, 2019 at 8:42

1 Answer 1

5

I made proof of concept using heapq.merge:

Step 1: generate testing file

Generate testing file containing 300 millions of rows:

from random import randint
row = '{} Essay   on the history of humans    On the history of humans    2016    This is another short description, this time of the essay\n'
with open('large_file.tsv', 'w') as f_out:
    for i in range(300_000_000):
        f_out.write(row.format(randint(1, 2147483647)))

Step 2: split into chunks and sort each chunk

Every chunk has 1milion rows:

import glob

path = "chunk_*.tsv"

chunksize = 1_000_000
fid = 1
lines = []

with open('large_file.tsv', 'r') as f_in:
    f_out = open('chunk_{}.tsv'.format(fid), 'w')
    for line_num, line in enumerate(f_in, 1):
        lines.append(line)
        if not line_num % chunksize:
            lines = sorted(lines, key=lambda k: int(k.split()[0]))
            f_out.writelines(lines)

            print('splitting', fid)
            f_out.close()
            lines = []
            fid += 1
            f_out = open('chunk_{}.tsv'.format(fid), 'w')

    # last chunk
    if lines:
        print('splitting', fid)
        lines = sorted(lines, key=lambda k: int(k.split()[0]))
        f_out.writelines(lines)
        f_out.close()
        lines = []

Step 3: merge each chunk

from heapq import merge

chunks = []
for filename in glob.glob(path):
    chunks += [open(filename, 'r')]

with open('sorted.tsv', 'w') as f_out:
    f_out.writelines(merge(*chunks, key=lambda k: int(k.split()[0])))

Timings:

My machine is Ubuntu Linux 18.04, AMD 2400G, Cheap WD SSD Green)

Step 2 - spliting and sorting chunks - took ~12 minutes

Step 3 - merging the chunks - took ~10 minutes

I expect these values much lower on machine with better disk (NVME?) and CPU.

Sign up to request clarification or add additional context in comments.

5 Comments

Nice! Two thoughts: 1) wouldn't it be faster to directly read/write gziped files? Since the code is probably I/O bound that should give it more rows/s right? 2) wouldn't it be cleaner to do print(line, file=f_out) and let python handle the buffering and writing of the file?
@FirefoxMetzger Yes, the problem is I/O bound. You could try reading from gzip (gzip.open()), do print() instead of write() etc. . Also play with the size of chunks...
Grate solution, I was looking for something like this. However I get a problem with your code, the output 'sorted.tsv' file is sorted by only up to half of it's content, what might be the case?
@tikej Check if the sorted()/merge() key= function is set-up correctly, also check if you get correct number of chunks and each chunk is sorted. Otherwise, I don't see a reason why only half of file is sorted and other half not.
@tikej Now I tested the code with 3_500_000 lines and it produced correct result. Also check that you closed all file objects before doing the merge.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.