Hashing a python function to regenerate output when the function is modified

Question

I have a python function that has a deterministic result. It takes a long time to run and generates a large output:

def time_consuming_function():
    # lots_of_computing_time to come up with the_result
    return the_result

I modify time_consuming_function from time to time, but I would like to avoid having it run again while it's unchanged. [time_consuming_function only depends on functions that are immutable for the purposes considered here; i.e. it might have functions from Python libraries but not from other pieces of my code that I'd change.] The solution that suggests itself to me is to cache the output and also cache some "hash" of the function. If the hash changes, the function will have been modified, and we have to re-generate the output.

Is this possible or ridiculous?

Updated: based on the answers, it looks like what I want to do is to "memoize" time_consuming_function, except instead of (or in addition to) arguments passed into an invariant function, I want to account for a function that itself will change.

How do you modify the method? Do you want to keep hash across program runs or is it within one run but across some module reloads? — user319799
– user319799, Commented Apr 26, 2010 at 20:49
I would have the method in a script file; I'd probably modify it by hand from time to time. The application is that this function would generate "problem data" for running in some simulation code. I'd change the problem from time to time. — Seth Johnson
– Seth Johnson, Commented Apr 26, 2010 at 21:23
Why don't you just use doc to check for changes in the function? Just change it every time you modify the function. It's both simple and encourages (or rather, forces) you to keep track track of your changes. As for the actual memoization, Mike Graham's got a perfect solution for you. — ktdrv
– ktdrv, Commented Apr 26, 2010 at 22:52

Mike Graham · Accepted Answer · 2010-04-26 22:37:27Z

6

If I understand your problem, I think I'd tackle it like this. It's a touch evil, but I think it's more reliable and on-point than the other solutions I see here.

import inspect
import functools
import json

def memoize_zeroadic_function_to_disk(memo_filename):
    def decorator(f):
        try:
            with open(memo_filename, 'r') as fp:
                cache = json.load(fp)
        except IOError:
            # file doesn't exist yet
            cache = {}

        source = inspect.getsource(f)

        @functools.wraps(f)
        def wrapper():
            if source not in cache:
                cache[source] = f()
                with open(memo_filename, 'w') as fp:
                    json.dump(cache, fp)

            return cache[source]
        return wrapper
    return decorator

@memoize_zeroadic_function_to_disk(...SOME PATH HERE...)
def time_consuming_function():
    # lots_of_computing_time to come up with the_result
    return the_result

edited Apr 26, 2010 at 22:37

answered Apr 26, 2010 at 22:17

Mike Graham

77.2k16 gold badges105 silver badges131 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Seth Johnson Over a year ago

So the only hashing that goes on is Python's internal dictionary key hashing, where the key is the string value of a function's entire uncompiled code. Is there a way to get the compiled code of the function, so changing the line spacing or comments won't result in a different value?

Mike Graham Over a year ago

@Seth, Yes, it makes sense to me to use Python's internal hashing here since what you really want is to compare the value (lest you have hash collisions and not know it. This is unlikely but very possible.) If you only wanted to cache the most recent function, I wouldn't use a dict or hashing at all, but just compare the value. Only since you said (out-of-band) that you wanted to store multiple versions of the function so you could return to old code did I use the dict.

Mike Graham Over a year ago

@Seth, It should be possible to store less information than the whole source code—whitespace and comments and such intact like this—but it will take some care to be positive you have a necessary and sufficient condition for a match. f.func_code.co_code is the bytecode that the function actually stores, but I'm not sure I can promise you it will be the same between compiles or not. I am also not completely sure it can't give you false positives.

Mike Graham · Accepted Answer · 2010-04-27 03:05:56Z

1

Rather than putting the function in a string, I would put the function in its own file. Call it time_consuming.py, for example. It would look something like this:

def time_consuming_method():
   # your existing method here

# Is the cached data older than this file?
if (not os.path.exists(data_file_name) 
    or os.stat(data_file_name).st_mtime < os.stat(__file__).st_mtime):
    data = time_consuming_method()
    save_data(data_file_name, data)
else:
    data = load_data(data_file_name)

# redefine method
def time_consuming_method():
    return data

While testing the infrastructure for this to work, I'd comment out the slow parts. Make a simple function that just returns 0, get all of the save/load stuff working to your satisfaction, then put the slow bits back in.

edited Apr 27, 2010 at 3:05

Mike Graham

77.2k16 gold badges105 silver badges131 bronze badges

answered Apr 26, 2010 at 20:58

Daniel Stutzbach

77.1k17 gold badges90 silver badges79 bronze badges

Comments

frankc · Accepted Answer · 2010-04-26 22:56:34Z

0

The first part is memoization and serialization of your lookup table. That should be straightforward enough based on some python serialization library. The second part is that you want to delete your serialized lookup table when the source code changes. Perhaps this is being overthought into some fancy solution. Presumably when you change the code you check it in somewhere? Why not add a hook to your checkin routine that deletes your serialized table? Or if this is not research data and is in production, make it part of your release process that if the revision number of your file (put this function in it's own file) has changed, your release script deletes the serialzed lookup table.

answered Apr 26, 2010 at 22:56

frankc

11.5k4 gold badges34 silver badges49 bronze badges

Comments

Michael Aaron Safyan · Accepted Answer · 2010-04-26 23:18:44Z

0

So, here is a really neat trick using decorators:

def memoize(f):
    cache={};
    def result(*args):
        if args not in cache:
           cache[args]=f(*args);
        return cache[args];
    return result;

With the above, you can then use:

@memoize
def myfunc(x,y,z):
   # Some really long running computation

When you invoke myfunc, you will actually be invoking the memoized version of it. Pretty neat, huh? Whenever you want to redefine your function, simply use "@memoize" again, or explicitly write:

myfunc = memoize(new_definition_for_myfunc);

Edit
I didn't realize that you wanted to cache between multiple runs. In that case, you can do the following:

import os;
import os.path;
import cPickle;

class MemoizedFunction(object):
     def __init__(self,f):
         self.function=f;
         self.filename=str(hash(f))+".cache";
         self.cache={};
         if os.path.exists(self.filename):
             with open(filename,'rb') as file:
                 self.cache=cPickle.load(file);

     def __call__(self,*args):
         if args not in self.cache:
             self.cache[args]=self.function(*args);
         return self.cache[args];

     def __del__(self):
         with open(self.filename,'wb') as file:
              cPickle.dump(self.cache,file,cPickle.HIGHEST_PROTOCOL);

def memoize(f):
    return MemoizedFunction(f);

edited Apr 26, 2010 at 23:18

answered Apr 26, 2010 at 21:01

Michael Aaron Safyan

96k16 gold badges140 silver badges200 bronze badges

4 Comments

Lasse V. Karlsen Over a year ago

Pretty neat, yes, but doesn't answer the question. The point was that the code in the method changed, not the input parameters to it.

Mike Graham Over a year ago

It appears that OP wants to memoize a function's one return value between runtimes. This caches a function's various return values during one run based on various arguments.

Michael Aaron Safyan Over a year ago

@Mike, ok. I didn't realize this was between runs of the program.

Mike Graham Over a year ago

Your current revision relies on hash(f) to be the same between runs, which is neither guaranteed nor expected.

MAK · Accepted Answer · 2010-04-27 09:24:05Z

-1

What you describe is effectively memoization. Most common functions can be memoized by defining a decorator.

A (overly simplified) example:

def memoized(f):
    cache={}
    def memo(*args):
        if args in cache:
            return cache[args]
        else:
            ret=f(*args)
            cache[args]=ret
            return ret
    return memo

@memoized
def time_consuming_method():
    # lots_of_computing_time to come up with the_result
    return the_result

Edit:

From Mike Graham's comment and the OP's update, it is now clear that values need to be cached over different runs of the program. This can be done by using some of of persistent storage for the cache (e.g. something as simple as using Pickle or a simple text file, or maybe using a full blown database, or anything in between). The choice of which method to use depends on what the OP needs. Several other answers already give some solutions to this, so I'm not going to repeat that here.

edited Apr 27, 2010 at 9:24

answered Apr 26, 2010 at 21:01

MAK

26.6k11 gold badges57 silver badges86 bronze badges

1 Comment

Mike Graham Over a year ago

It appears that OP wants to memoize a function's one return value between runtimes. This caches a function's various return values during one run based on various arguments.

Collectives™ on Stack Overflow

Hashing a python function to regenerate output when the function is modified

5 Answers 5

3 Comments

Comments

Comments

4 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

3 Comments

Comments

Comments

4 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related