23

I have a python function that has a deterministic result. It takes a long time to run and generates a large output:

def time_consuming_function():
    # lots_of_computing_time to come up with the_result
    return the_result

I modify time_consuming_function from time to time, but I would like to avoid having it run again while it's unchanged. [time_consuming_function only depends on functions that are immutable for the purposes considered here; i.e. it might have functions from Python libraries but not from other pieces of my code that I'd change.] The solution that suggests itself to me is to cache the output and also cache some "hash" of the function. If the hash changes, the function will have been modified, and we have to re-generate the output.

Is this possible or ridiculous?


Updated: based on the answers, it looks like what I want to do is to "memoize" time_consuming_function, except instead of (or in addition to) arguments passed into an invariant function, I want to account for a function that itself will change.

3
  • How do you modify the method? Do you want to keep hash across program runs or is it within one run but across some module reloads? Commented Apr 26, 2010 at 20:49
  • I would have the method in a script file; I'd probably modify it by hand from time to time. The application is that this function would generate "problem data" for running in some simulation code. I'd change the problem from time to time. Commented Apr 26, 2010 at 21:23
  • Why don't you just use doc to check for changes in the function? Just change it every time you modify the function. It's both simple and encourages (or rather, forces) you to keep track track of your changes. As for the actual memoization, Mike Graham's got a perfect solution for you. Commented Apr 26, 2010 at 22:52

5 Answers 5

6

If I understand your problem, I think I'd tackle it like this. It's a touch evil, but I think it's more reliable and on-point than the other solutions I see here.

import inspect
import functools
import json

def memoize_zeroadic_function_to_disk(memo_filename):
    def decorator(f):
        try:
            with open(memo_filename, 'r') as fp:
                cache = json.load(fp)
        except IOError:
            # file doesn't exist yet
            cache = {}

        source = inspect.getsource(f)

        @functools.wraps(f)
        def wrapper():
            if source not in cache:
                cache[source] = f()
                with open(memo_filename, 'w') as fp:
                    json.dump(cache, fp)

            return cache[source]
        return wrapper
    return decorator

@memoize_zeroadic_function_to_disk(...SOME PATH HERE...)
def time_consuming_function():
    # lots_of_computing_time to come up with the_result
    return the_result
Sign up to request clarification or add additional context in comments.

3 Comments

So the only hashing that goes on is Python's internal dictionary key hashing, where the key is the string value of a function's entire uncompiled code. Is there a way to get the compiled code of the function, so changing the line spacing or comments won't result in a different value?
@Seth, Yes, it makes sense to me to use Python's internal hashing here since what you really want is to compare the value (lest you have hash collisions and not know it. This is unlikely but very possible.) If you only wanted to cache the most recent function, I wouldn't use a dict or hashing at all, but just compare the value. Only since you said (out-of-band) that you wanted to store multiple versions of the function so you could return to old code did I use the dict.
@Seth, It should be possible to store less information than the whole source code—whitespace and comments and such intact like this—but it will take some care to be positive you have a necessary and sufficient condition for a match. f.func_code.co_code is the bytecode that the function actually stores, but I'm not sure I can promise you it will be the same between compiles or not. I am also not completely sure it can't give you false positives.
1

Rather than putting the function in a string, I would put the function in its own file. Call it time_consuming.py, for example. It would look something like this:

def time_consuming_method():
   # your existing method here

# Is the cached data older than this file?
if (not os.path.exists(data_file_name) 
    or os.stat(data_file_name).st_mtime < os.stat(__file__).st_mtime):
    data = time_consuming_method()
    save_data(data_file_name, data)
else:
    data = load_data(data_file_name)

# redefine method
def time_consuming_method():
    return data

While testing the infrastructure for this to work, I'd comment out the slow parts. Make a simple function that just returns 0, get all of the save/load stuff working to your satisfaction, then put the slow bits back in.

Comments

0

The first part is memoization and serialization of your lookup table. That should be straightforward enough based on some python serialization library. The second part is that you want to delete your serialized lookup table when the source code changes. Perhaps this is being overthought into some fancy solution. Presumably when you change the code you check it in somewhere? Why not add a hook to your checkin routine that deletes your serialized table? Or if this is not research data and is in production, make it part of your release process that if the revision number of your file (put this function in it's own file) has changed, your release script deletes the serialzed lookup table.

Comments

0

So, here is a really neat trick using decorators:

def memoize(f):
    cache={};
    def result(*args):
        if args not in cache:
           cache[args]=f(*args);
        return cache[args];
    return result;

With the above, you can then use:

@memoize
def myfunc(x,y,z):
   # Some really long running computation

When you invoke myfunc, you will actually be invoking the memoized version of it. Pretty neat, huh? Whenever you want to redefine your function, simply use "@memoize" again, or explicitly write:

myfunc = memoize(new_definition_for_myfunc);

Edit
I didn't realize that you wanted to cache between multiple runs. In that case, you can do the following:

import os;
import os.path;
import cPickle;

class MemoizedFunction(object):
     def __init__(self,f):
         self.function=f;
         self.filename=str(hash(f))+".cache";
         self.cache={};
         if os.path.exists(self.filename):
             with open(filename,'rb') as file:
                 self.cache=cPickle.load(file);

     def __call__(self,*args):
         if args not in self.cache:
             self.cache[args]=self.function(*args);
         return self.cache[args];

     def __del__(self):
         with open(self.filename,'wb') as file:
              cPickle.dump(self.cache,file,cPickle.HIGHEST_PROTOCOL);

def memoize(f):
    return MemoizedFunction(f);

4 Comments

Pretty neat, yes, but doesn't answer the question. The point was that the code in the method changed, not the input parameters to it.
It appears that OP wants to memoize a function's one return value between runtimes. This caches a function's various return values during one run based on various arguments.
@Mike, ok. I didn't realize this was between runs of the program.
Your current revision relies on hash(f) to be the same between runs, which is neither guaranteed nor expected.
-1

What you describe is effectively memoization. Most common functions can be memoized by defining a decorator.

A (overly simplified) example:

def memoized(f):
    cache={}
    def memo(*args):
        if args in cache:
            return cache[args]
        else:
            ret=f(*args)
            cache[args]=ret
            return ret
    return memo

@memoized
def time_consuming_method():
    # lots_of_computing_time to come up with the_result
    return the_result

Edit:

From Mike Graham's comment and the OP's update, it is now clear that values need to be cached over different runs of the program. This can be done by using some of of persistent storage for the cache (e.g. something as simple as using Pickle or a simple text file, or maybe using a full blown database, or anything in between). The choice of which method to use depends on what the OP needs. Several other answers already give some solutions to this, so I'm not going to repeat that here.

1 Comment

It appears that OP wants to memoize a function's one return value between runtimes. This caches a function's various return values during one run based on various arguments.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.