Saturday, November 29, 2008

Building a simple identity map in Django

In Django's ticket tracker lies ticket 17, the second oldest open ticket, this proposes an optimisation to have instances of the same database object be represented by the same object in Python, essentially that means for this code:

a = Model.objects.get(pk=3)
b = Model.objects.get(pk=3)

a and b would be the same object at the memory level. This can represent a large optimisation in memory usage if you're application has the potential to have duplicate objects(for example, related objects). It is possible to implement a very simple identity map without touching the Django source at all.

The first step is to set up some very basic infastructure, this is going to be almost identical to what Eric Florenzano does in his post, "Drop-dead simple Django caching".

We start with a few helper functions:

_CACHE = {}

def key_for_instance(obj, pk=None):
if pk is None:
pk = obj.pk
return "%s-%s-%s" % (obj._meta.app_label, obj._meta.module_name, pk)

def get_from_cache(klass, pk):
return _CACHE[key_for_instance(klass, pk)]

def cache_instance(instance):
_CACHE[key_for_instance(instance)] = instance


We create our cache, which is a Python dictionary, a function to generate the cache key for an object, a function to get an item from the cache, and a function to cache an item. How these work should be pretty simple. Next we need to create some functions to make sure objects get update in the cache.

from django.db.models.signals import post_save, pre_delete

def post_save_cache(sender, instance, **kwargs):
cache_instance(instance)
post_save.connect(post_save_cache)

def pre_delete_uncache(sender, instance, **kwargs):
try:
del _CACHE[key_for_instance(instance)]
except KeyError:
pass
pre_delete.connect(pre_delete_uncache)


Here we set up two signal receivers, when an object is saved we cache it, and when one is deleted we remove it from the cache.

Now we want a way to use our cache the way we already use our connection to the database, this means implementing some sort of hook in a QuerySet, this looks like:

from django.db.models.query import QuerySet

class CachingQueryset(QuerySet):
def __iter__(self):
obj = self.values_list('pk', flat=True)
for pk in obj:
try:
yield get_from_cache(self.model, pk)
except KeyError:
instance = QuerySet(self.model).get(pk=pk)
cache_instance(instance)
yield instance

def get(self, *args, **kwargs):
clone = self.filter(*args, **kwargs)
objs = list(clone[:2])
if len(objs) == 1:
return objs[0]
if not objs:
raise self.model.DoesNotExist("%s matching query does not exist."
% self.model._meta.object_name)
raise self.model.MultipleObjectsReturned("get() returned more than one %s -- it returned %s! Lookup parameters were %s"
% (self.model._meta.object_name, len(objs), kwargs))



We create a subclass of QuerySet and override it's __iter__() and get() methods. By default __iter__ does a fair bit of heavy lifting to internally cache the results and allow the usage of multiple iterators properly. We override this to do something simpler. We get the primary keys of each item in the queryset and iterate over them, if the object is in the cache we return it, otherwise we execute a database query to get it, and then cache it. We also override get() to make sure it makes use of the caching we just set up.

To use this on a model we need to create a simple manager:

class CachingManager(Manager):
def get_query_set(self):
return CachingQuerySet(self.model)


And then we can use this with our models:

class Post(models.Model):
title = models.CharField(max_length=100)

objects = CachingManager()

Post.objects.all()


Now all Posts accessed within the same thread will be cached using the strategy we've implemented.

This strategy will not save us database queries, indeed in some cases it can result in many more queries, it is designed to save memory usage(and be implemented as simply as possible). It can also be made far more useful by having related objects use this strategy as well(if Post had a foreign key to author it would be nice to have all post authors share the same instances, since even you have a large queryset of Posts were all the Posts are unique, they are likely to have duplicate authors).

2 comments:

  1. Crap.. Just wrote a long reply and blogger ate it.

    in short: don't use _CACHE = {} trick, its very bad. Create a new cache object from django.core (you dont need to use django.core.cache.cache for everything... the pycon site has 8 cache objects).

    Use a real thread safe cache (dict will leak memory and greatly slow down your site over time).

    use django.utils.cache (has stuff for key name generation from PK, etc) and/or django.core.cache.get_cache

    They are there for a reason, and is actually _less_ code.

    ReplyDelete
  2. Caveat emptor: Process-local cache like this obviously has coherence problems if some other process updates a model instance.

    ReplyDelete

Note: Only a member of this blog may post a comment.