4

I found some this promising code on activestate.com to sort huge files. I'm trying to run it on the default Python 2.6.5 interpreter on Ubuntu 10.04. When I try running it on a small test file, I get the error trace below. I asked for help on activestate.com, but this thread has been silent for over 18 months. Is there anyone here who sees an obvious solution?

Thanks.

## {{{ http://code.activestate.com/recipes/576755/ (r3)
# based on Recipe 466302: Sorting big files the Python 2.4 way
# by Nicolas Lehuen

import os
from tempfile import gettempdir
from itertools import islice, cycle
from collections import namedtuple
import heapq

Keyed = namedtuple("Keyed", ["key", "obj"])

def merge(key=None, *iterables):
    # based on code posted by Scott David Daniels in c.l.p.
    # http://groups.google.com/group/comp.lang.python/msg/484f01f1ea3c832d

    if key is None:
        keyed_iterables = iterables
    else:
        keyed_iterables = [(Keyed(key(obj), obj) for obj in iterable)
                            for iterable in iterables]

    for element in heapq.merge(*keyed_iterables):
        yield element.obj


def batch_sort(input, output, key=None, buffer_size=32000, tempdirs=None):
    if tempdirs is None:
        tempdirs = []
    if not tempdirs:
        tempdirs.append(gettempdir())

    chunks = []
    try:
        with open(input,'rb',64*1024) as input_file:
            input_iterator = iter(input_file)
            for tempdir in cycle(tempdirs):
                current_chunk = list(islice(input_iterator,buffer_size))
                if not current_chunk:
                    break
                current_chunk.sort(key=key)
                output_chunk = open(os.path.join(tempdir,'%06i'%len(chunks)),'w+b',64*1024)
                chunks.append(output_chunk)
                output_chunk.writelines(current_chunk)
                output_chunk.flush()
                output_chunk.seek(0)
        with open(output,'wb',64*1024) as output_file:
            output_file.writelines(merge(key, *chunks))
    finally:
        for chunk in chunks:
            try:
                chunk.close()
                os.remove(chunk.name)
            except Exception:
                pass

Error trace:

Traceback (most recent call last):
  File "./batch_sort.py", line 108, in <module>
    batch_sort(args[0],args[1],options.key,options.buffer_size,options.tempdirs)
  File "./batch_sort.py", line 54, in batch_sort
    output_file.writelines(merge(key, *chunks))
  File "./batch_sort.py", line 30, in merge
    yield element.obj
AttributeError: 'str' object has no attribute 'obj'
3
  • You're unclear what "huge" means, so I'll take it to mean "huge". If you are really sorting huge files, you probably don't want to use Python to do it. Its interpretive nature coupled with dynamic storage allocation is likely to make it slow doing this. Go find a standalone sort utility; these are designed to sort large amounts of data as fast as possible. Commented May 19, 2012 at 14:19
  • Good question. I define "huge" as a UTF-8 file with 14 million or more lines, each line averaging 175 characters, totaling between 2.5 to 7.5 GB (many files have all 3-byte UTF-8 characters). Alternatives are using Linux sort from a bash script/terminal. Performance of the older version of this code is okay, but this is supposed to be faster. Commented May 19, 2012 at 14:29
  • One may want to check this lib that implement external sort in pure python Commented May 27, 2019 at 14:27

1 Answer 1

2

The code for merge is incorrect. If you don't provide a key, each element is a string instead of a keyed tuple.

Try this instead:

def merge(key=None, *iterables):
    # based on code posted by Scott David Daniels in c.l.p.
    # http://groups.google.com/group/comp.lang.python/msg/484f01f1ea3c832d

    if key is None:
        for element in heapq.merge(*iterables):
            yield element
    else:
        keyed_iterables = [(Keyed(key(obj), obj) for obj in iterable)
                        for iterable in iterables]
        for element in heapq.merge(*keyed_iterables):
            yield element.obj
Sign up to request clarification or add additional context in comments.

3 Comments

@tahoar I am using the same script to sort huge file. When run I get error on line 51 with open(output,'wb',64*1024) as output_file: output_file.writelines(merge(key, *chunks)) valueerror: I/O operation on closed file. Have you seen this error ? Sort worked fine for a small file though!
@Think, The truth is, I abandoned this effort. I encountered new problems each time I achieved larger file sizes. As I only need this functionality on Linux, my final solution uses Python's subprocess.Popen() to call the Linux "sort" application and all my problems disappeared. Sorry I can't help further.
@tahoar That's exactly the right approach. GNU sort is very sophisticated in making the most of the available hardware (memory and CPU). Once you run out of memory, using GNU sort is hard to beat in pure Python.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.