Tuesday, February 10, 2009

A Second Look at Inheritance and Polymorphism with Django

Previously I wrote about ways to handle polymorphism with inheritance in Django's ORM in a way that didn't require any changes to your model at all(besides adding in a mixin), today we're going to look at a way to do this that is a little more invasive and involved, but also can provide much better performance. As we saw previously with no other information we could get the correct subclass for a given object in O(k) queries, where k is the number of subclasses. This means for a queryset with n items, we would need to do O(nk) queries, not great performance, for a queryset with 10 items, and 3 subclasses we'd need to do 30 queries, which isn't really acceptable for most websites. The major problem here is that for each object we simply guess as to which subclass a given object is. However, that's a piece of information we could know concretely if we cached it for later usage, so let's start off there, we're going to be building a mixin class just like we did last time:

from django.db import models

class InheritanceMixIn(models.Model):
_class = models.CharField(max_length=100)

class Meta:
abstract = True



So now we have a simple abstract model that the base of our inheritance trees can subclass that has a field for caching which subclass we are. Now let's add a method to actually cache it and retrieve the subclass:

from django.db import models
from django.db.models.fields import FieldDoesNotExist
from django.db.models.related import RelatedObject

class InheritanceMixIn(models.Model):
...
def save(self, *args, **kwargs):
if not self.id:
parent = self._meta.parents.keys()[0]
subclasses = parent._meta.get_all_related_objects()
for klass in subclasses:
if isinstance(klass, RelatedObject) and klass.field.primary_key \
and klass.opts == self._meta:
self._class = klass.get_accessor_name()
break
return super(InheritanceMixIn, self).save(*args, **kwargs)

def get_object(self):
try:
if self._class and self._meta.get_field_by_name(self._class)[0].opts != self._meta:
return getattr(self, self._class)
except FieldDoesNotExist:
pass
return self


Our save method is where all the magic really happens. First, we make sure we're only doing this caching if it's the first time a model is being saved. Then we get the first parent class we have (this means this probably won't play nicely with multiple inheritance, that's unfortunate, but not as common a usecase), then we get all the related objects this class has(this includes the reverse relationship the subclasses have). Then for each of the subclasses, if it is a RelatedObject, and it is a primary key on it's model, and the class it points to is the same as us then we cache the accessor name on the model, break out, and do the normal save procedure.

Our get_object function is pretty simple, if we have our class cached, and the model we are cached as isn't of the same type as ourselves we get the attribute with the subclass and return it, otherwise we are the last descendent and just return ourselves. There is one(possible quite large) caveat here, if our inheritance chain is more than one level deep(that is to say our subclasses have subclasses) then this won't return those objects correctly. The class is actually cached correctly, but since the top level object doesn't have an attribute by the name of the 2nd level subclass it doesn't return anything. I believe this can be worked around, but I haven't found a way yet. One idea would be to actually store the full ancestor chain in the CharField, comma separated, and then just traverse it.

There is one thing we can do to make this even easier, which is to have instances automatically become the correct subclass when they are pulled in from the DB. This does have an overhead, pulling in a queryset with n items guarantees O(n) queries. This can be improved(just as it was for the previous solution) by ticket #7270 which allows select_related to traverse reverse relationships. In any event, we can write a metaclass to handle this for us automatically:

from django.db import models
from django.db.models.base import ModelBase
from django.db.models.fields import FieldDoesNotExist
from django.db.models.related import RelatedObject

class InheritanceMetaclass(ModelBase):
def __call__(cls, *args, **kwargs):
obj = super(InheritanceMetaclass, cls).__call__(*args, **kwargs)
return obj.get_object()

class InheritanceMixIn(models.Model):
__metaclass__ = InheritanceMetaclass
...


Here we've created a fairly trivial metaclass that subclasses the default one Django uses for it's models. The only method we've written is __call__, on a metalcass what __call__ does is handle the instantiation of an object, so it would call __init__. What we do is do whatever the default __call__ does, so that we get an instances as normal, and then we call the get_object() method we wrote earlier and return it, and that's all.

We've now looked at 2 ways to handle polymorphism, with this way being more efficient in all cases(ignoring the overhead of having the extra charfield). However, it still isn't totally efficient, and it fails in several edge cases. Whether automating the handling of something like this is a good idea is something that needs to be considered on a project by project basis, as the extra queries can be a large overhead, however, they may not be avoidable in which case automating it is probably advantages.

3 comments:

  1. Nice post. We're doing something similar on a project I'm working on. A few thoughts:

    Why not user a ForeignKey to ContentType instead of storing the class name in a char field? That should resolve all of your problems w/ traversing the mro and whatnot.

    In save() you'd do:
    self.content_type = ContentType.objects.get_for_model(self)

    And in get_object() you'd do:
    leaf_type = self.content_type
    if not leaf_type:
    return self
    leaf_class = leaf_type.model_class()
    if isinstance(self, leaf_class):
    return self
    return leaf_class.objects.get(pk=self.pk)

    The metaclass you're using is interesting. I have mixed feelings about it. What we're doing is replacing the default manager with our own subclassed manager (we dynamically subclass the existing Manager), then that manager returns a custom QuerySet that dynamically subclasses the existing Manager's QuerySet. Converting to the 'leaf' objects is handled by our QuerySet subclass in get(), latest(), and __iter__().

    Your metaclass is actually cleaner than the one we're using, which I like. But with ours you have the ability to create a second manager that won't do the additional work to get the subclass, which is a nice feature.

    ReplyDelete
  2. Mike,

    You raise a couple of really good points, especially with regards to using a ContentType foreignkey. The one disadvantage of that approach is that you'll lose the advantages from #7270, since you'll be doing a whole new query and bypassing the objects cache, on the other hand that way clearly benefits from simple PK based caching(as I've also written about).

    And yes, one of the unfortunate effects of using the MetaClass is that it's all or nothing behavior.

    ReplyDelete
  3. I do this using an FK to ContentType as well. The metaclass idea is interesting; I just have a cast() method on the InheritanceMixin that I call manually if I want to cast to the subclass type (because I don't always need to). Seems like with a little work it would be possible to write a manager that gets all the objects as their subclass type with only k+1 queries (one for the superclass table, one for each relevant subclass table) rather than n+1.

    ReplyDelete

Note: Only a member of this blog may post a comment.