How to sort huge files with Python?

Question

I found some this promising code on activestate.com to sort huge files. I'm trying to run it on the default Python 2.6.5 interpreter on Ubuntu 10.04. When I try running it on a small test file, I get the error trace below. I asked for help on activestate.com, but this thread has been silent for over 18 months. Is there anyone here who sees an obvious solution?

Thanks.

## {{{ http://code.activestate.com/recipes/576755/ (r3)
# based on Recipe 466302: Sorting big files the Python 2.4 way
# by Nicolas Lehuen

import os
from tempfile import gettempdir
from itertools import islice, cycle
from collections import namedtuple
import heapq

Keyed = namedtuple("Keyed", ["key", "obj"])

def merge(key=None, *iterables):
    # based on code posted by Scott David Daniels in c.l.p.
    # http://groups.google.com/group/comp.lang.python/msg/484f01f1ea3c832d

    if key is None:
        keyed_iterables = iterables
    else:
        keyed_iterables = [(Keyed(key(obj), obj) for obj in iterable)
                            for iterable in iterables]

    for element in heapq.merge(*keyed_iterables):
        yield element.obj


def batch_sort(input, output, key=None, buffer_size=32000, tempdirs=None):
    if tempdirs is None:
        tempdirs = []
    if not tempdirs:
        tempdirs.append(gettempdir())

    chunks = []
    try:
        with open(input,'rb',64*1024) as input_file:
            input_iterator = iter(input_file)
            for tempdir in cycle(tempdirs):
                current_chunk = list(islice(input_iterator,buffer_size))
                if not current_chunk:
                    break
                current_chunk.sort(key=key)
                output_chunk = open(os.path.join(tempdir,'%06i'%len(chunks)),'w+b',64*1024)
                chunks.append(output_chunk)
                output_chunk.writelines(current_chunk)
                output_chunk.flush()
                output_chunk.seek(0)
        with open(output,'wb',64*1024) as output_file:
            output_file.writelines(merge(key, *chunks))
    finally:
        for chunk in chunks:
            try:
                chunk.close()
                os.remove(chunk.name)
            except Exception:
                pass

Error trace:

Traceback (most recent call last):
  File "./batch_sort.py", line 108, in <module>
    batch_sort(args[0],args[1],options.key,options.buffer_size,options.tempdirs)
  File "./batch_sort.py", line 54, in batch_sort
    output_file.writelines(merge(key, *chunks))
  File "./batch_sort.py", line 30, in merge
    yield element.obj
AttributeError: 'str' object has no attribute 'obj'

You're unclear what "huge" means, so I'll take it to mean "huge". If you are really sorting huge files, you probably don't want to use Python to do it. Its interpretive nature coupled with dynamic storage allocation is likely to make it slow doing this. Go find a standalone sort utility; these are designed to sort large amounts of data as fast as possible. — Ira Baxter
– Ira Baxter, Commented May 19, 2012 at 14:19
Good question. I define "huge" as a UTF-8 file with 14 million or more lines, each line averaging 175 characters, totaling between 2.5 to 7.5 GB (many files have all 3-byte UTF-8 characters). Alternatives are using Linux sort from a bash script/terminal. Performance of the older version of this code is okay, but this is supposed to be faster. — tahoar
– tahoar, Commented May 19, 2012 at 14:29
One may want to check this lib that implement external sort in pure python — RomainL.
– RomainL., Commented May 27, 2019 at 14:27

Mike Bryant · Accepted Answer · 2012-05-19 14:38:40Z

2

The code for merge is incorrect. If you don't provide a key, each element is a string instead of a keyed tuple.

Try this instead:

def merge(key=None, *iterables):
    # based on code posted by Scott David Daniels in c.l.p.
    # http://groups.google.com/group/comp.lang.python/msg/484f01f1ea3c832d

    if key is None:
        for element in heapq.merge(*iterables):
            yield element
    else:
        keyed_iterables = [(Keyed(key(obj), obj) for obj in iterable)
                        for iterable in iterables]
        for element in heapq.merge(*keyed_iterables):
            yield element.obj

answered May 19, 2012 at 14:38

Mike Bryant

1,1229 silver badges11 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Think Over a year ago

@tahoar I am using the same script to sort huge file. When run I get error on line 51 with open(output,'wb',64*1024) as output_file: output_file.writelines(merge(key, *chunks)) valueerror: I/O operation on closed file. Have you seen this error ? Sort worked fine for a small file though!

tahoar Over a year ago

@Think, The truth is, I abandoned this effort. I encountered new problems each time I achieved larger file sizes. As I only need this functionality on Linux, my final solution uses Python's subprocess.Popen() to call the Linux "sort" application and all my problems disappeared. Sorry I can't help further.

Lutz Prechelt Over a year ago

@tahoar That's exactly the right approach. GNU sort is very sophisticated in making the most of the available hardware (memory and CPU). Once you run out of memory, using GNU sort is hard to beat in pure Python.

Collectives™ on Stack Overflow

How to sort huge files with Python?

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related