Lazy Pythonista: January 2009

Saturday, January 31, 2009

Building a Magic Manager

A very common pattern in Django is to create methods on a manager to abstract some usage of ones data. Some people take a second step and actually create a custom QuerySet subclass with these methods and have their manager proxy these methods to the QuerySet, this pattern is seen in Eric Florenzano's Django From the Ground Up screencast. However, this requires a lot of repetition, it would be far less verbose if we could just define our methods once and have them available to us on both our managers and QuerySets.

Django's manager class has one hook for providing the QuerySet, so we'll start with this:


from django.db import models

class MagicManager(models.Manager):
   def get_query_set(self):
       qs = super(MagicManager, self).get_query_set()
       return qs

Here we have a very simple get_query_set method, it doesn't do anything but return it's parent's queryset. Now we need to actually get the methods defined on our class onto the queryset:


class MagicManager(models.Manager):
   def get_query_set(self):
       qs = super(MagicManager, self).get_query_set()
       class _QuerySet(qs.__class__):
           pass
       for method in [attr for attr in dir(self) if not attr.startswith('__') and callable(getattr(self, attr)) and not hasattr(_QuerySet, attr)]:
           setattr(_QuerySet, method, getattr(self, method))
       qs.__class__ = _QuerySet
       return qs

The trick here is we dynamically create a subclass of whatever class the call to our parent's get_query_set method returns, then we take each attribute on ourself, and if the queryset doesn't have an attribute by that name, and if that attribute is a method then we assign it to our QuerySet subclass. Finally we set the __class__ attribute of the queryset to be our QuerySet subclass. The reason this works is when Django chains queryset methods it makes the copy of the queryset have the same class as the current one, so anything we add to our manager will not only be available on the immediately following queryset, but on any that follow due to chaining.

Now that we have this we can simply subclass it to add methods, and then add it to our models like a regular manager. Whether this is a good idea is a debatable issue, on the one hand having to write methods twice is a gross violation of Don't Repeat Yourself, however this is exceptionally implicit, which is a major violation of The Zen of Python.

Saturday, January 24, 2009

Django Ajax Validation 0.1.0 Released

I've just uploaded the first official release of Django Ajax Validation up to PyPi. You can get it there. For those that don't know it is a reusable Django application that allows you to do JS based validation using you're existing form definitions. Currently it only works using jQuery. If there are any problems please let me know.

Monday, January 19, 2009

Optimizing a View

Lately I've been playing with a bit of a fun side project. I have about a year and half worth of my own chatlogs with friends(and 65,000 messages total) and I've been playing around with them to find interesting statistics. One facet of my communication with my friends is that we link each other lots of things, and we can always tell when someone is linking something that we've already seen. So I decided an interesting bit of information would be to see who is the worst offender.

So we want to write a function that returns the number of items each person has relinked, excluding items they themselves linked. So I started off with the most simple implementation I could, and this was the end result:


from collections import defaultdict
from operator import itemgetter

from django.utils.html import word_split_re

from logger.models import Message

def calculate_relinks():
    """
    Calculate the number of times each individual has linked something that was
    linked previously in the course of the chat.
    """
    links = defaultdict(int)
    for message in Message.objects.all().order_by('-time').iterator():
        words = word_split_re.split(message.message)
        for word in words:
            if word.startswith('http'):
                if Message.objects.filter(time__lt=message.time).filter(message__contains=word).exclude(speaker=message.speaker).count():
                    links[message.speaker] += 1
    links = sorted(links.iteritems(), key=itemgetter(1), reverse=True)
    return links

Here I iterated over the messages and for each one I went through each of the words and if any of them started with http(the definition of a link for my purposes) I checked to see if this had ever been linked before by someone other than the author of the current message.

This took about 4 minutes to execute on my dataset, it also executed about 10,000 SQL queries. This is clearly unacceptable, you can't have a view that takes that long to render, or hits your DB that hard. Even with aggressive caching this would have been unmaintainable. Further this algorithm is O(n**2) or thereabouts so as my dataset grows this would have gotten worse exponentially.

By changing this around however I was able to achieve far better results:


from collections import defaultdict
from operator import itemgetter

from django.utils.html import word_split_re

from logger.models import Message

def calculate_relinks():
    """
    Calculate the number of times each individual has linked something that was
    linked previously in the course of the chat.
    """
    links = defaultdict(set)
    counts = defaultdict(int)
    for message in Message.objects.all().filter(message__contains="http").order_by('time').iterator():
        words = word_split_re.split(message.message)
        for word in words:
            if word.startswith('http'):
                if any([word in links[speaker] for speaker in links if speaker != message.speaker]):
                    counts[message.speaker] += 1
                links[message.speaker].add(word)
    counts = sorted(counts.iteritems(), key=itemgetter(1), reverse=True)
    return counts

Here what I do is go through each of the messages which contain the string "http"(this is already a huge advantage since that means we process about 1/6 of the messages in Python that we originally were), for each message we go through each of the words in it, and for each that is a link we check if any other person has said it by looking in the caches we maintain in Python, and if they do we increment their count, finally we add the link to that persons cache.

By comparison this executes in .3 seconds, executes only 1 SQL query, and it will scale linearly(as well as is possible). For reference both of these functions are compiled using Cython. This ultimately takes almost no work to do and for computationally heavy operations this can provide a huge boon.

Wednesday, January 14, 2009

New Admin URLs

I'm very happy to announce that with revision 9739 of Django the admin now uses normal URL resolvers and its URLs can be reversed. This is a tremendous improvement over the previous ad hoc system and it gives users the distinct advantage of being able to reverse the admin's URLs. However, in order to make this work a new feature went into the URL resolving system that any user can use in their own code.

Specifically users can now have objects which provide URLs. Basically this is because include() can now accept any iterable, rather than just a string which points to a urlconf. To get an idea of what this looks like you can look at the way the admin now does it.

There are going to be a few more great additions to Django going in as we move towards the 1.1 alpha so keep an eye out for them.

Saturday, January 3, 2009

2008 and 2009

Eric Florenzano recently wrote an excellent post titled, "2008 in Review & 2009 Goals". I am now going to blantly steal his idea.

2008:

Joined Twitter
Voted for the first time
Had a blast at PyCon
Wrote a reusable Django application and open sourced it
Learned git
Got accepted at, and began studying Computer Science at RPI
Wrote a blog post every day for 30 days in a row
Started work on my programming language, Al
Contributed heavily to Django
Started a blog
Proposed a panel for PyCon 2009 and was accepted

And probably quite a bit more.

2009:

Continue work on Al, bring it up to a usable state
Continue writing on my blog
Propose a talk for another conference
Continue contributing to Django
Become a committer on one of the projects I use
Do something interesting over the summer, hopefully a cool internship or Google Summer of Code
Do well in class

In light of all the stuff that happened in 2008 I hope these are reasonable goals.

Lazy Pythonista