Lazy Pythonista: November 2008

Saturday, November 29, 2008

Building a simple identity map in Django

In Django's ticket tracker lies ticket 17, the second oldest open ticket, this proposes an optimisation to have instances of the same database object be represented by the same object in Python, essentially that means for this code:


a = Model.objects.get(pk=3)
b = Model.objects.get(pk=3)

a and b would be the same object at the memory level. This can represent a large optimisation in memory usage if you're application has the potential to have duplicate objects(for example, related objects). It is possible to implement a very simple identity map without touching the Django source at all.

The first step is to set up some very basic infastructure, this is going to be almost identical to what Eric Florenzano does in his post, "Drop-dead simple Django caching".

We start with a few helper functions:


_CACHE = {}

def key_for_instance(obj, pk=None):
    if pk is None:
        pk = obj.pk
    return "%s-%s-%s" % (obj._meta.app_label, obj._meta.module_name, pk)

def get_from_cache(klass, pk):
    return _CACHE[key_for_instance(klass, pk)]

def cache_instance(instance):
    _CACHE[key_for_instance(instance)] = instance

We create our cache, which is a Python dictionary, a function to generate the cache key for an object, a function to get an item from the cache, and a function to cache an item. How these work should be pretty simple. Next we need to create some functions to make sure objects get update in the cache.


from django.db.models.signals import post_save, pre_delete

def post_save_cache(sender, instance, **kwargs):
    cache_instance(instance)
post_save.connect(post_save_cache)

def pre_delete_uncache(sender, instance, **kwargs):
    try:
        del _CACHE[key_for_instance(instance)]
    except KeyError:
        pass
pre_delete.connect(pre_delete_uncache)

Here we set up two signal receivers, when an object is saved we cache it, and when one is deleted we remove it from the cache.

Now we want a way to use our cache the way we already use our connection to the database, this means implementing some sort of hook in a QuerySet, this looks like:


from django.db.models.query import QuerySet

class CachingQueryset(QuerySet):
    def __iter__(self):
        obj = self.values_list('pk', flat=True)
        for pk in obj:
            try:
                yield get_from_cache(self.model, pk)
            except KeyError:
                instance = QuerySet(self.model).get(pk=pk)
                cache_instance(instance)
                yield instance
    
    def get(self, *args, **kwargs):
        clone = self.filter(*args, **kwargs)
        objs = list(clone[:2])
        if len(objs) == 1:
            return objs[0]
        if not objs:
            raise self.model.DoesNotExist("%s matching query does not exist."
                             % self.model._meta.object_name)
        raise self.model.MultipleObjectsReturned("get() returned more than one %s -- it returned %s! Lookup parameters were %s"
                % (self.model._meta.object_name, len(objs), kwargs))

We create a subclass of QuerySet and override it's __iter__() and get() methods. By default __iter__ does a fair bit of heavy lifting to internally cache the results and allow the usage of multiple iterators properly. We override this to do something simpler. We get the primary keys of each item in the queryset and iterate over them, if the object is in the cache we return it, otherwise we execute a database query to get it, and then cache it. We also override get() to make sure it makes use of the caching we just set up.

To use this on a model we need to create a simple manager:


class CachingManager(Manager):
    def get_query_set(self):
        return CachingQuerySet(self.model)

And then we can use this with our models:


class Post(models.Model):
    title = models.CharField(max_length=100)
    
    objects = CachingManager()

Post.objects.all()

Now all Posts accessed within the same thread will be cached using the strategy we've implemented.

This strategy will not save us database queries, indeed in some cases it can result in many more queries, it is designed to save memory usage(and be implemented as simply as possible). It can also be made far more useful by having related objects use this strategy as well(if Post had a foreign key to author it would be nice to have all post authors share the same instances, since even you have a large queryset of Posts were all the Posts are unique, they are likely to have duplicate authors).

Other ORM Goodies

In addition to the aggregate work, the GSOC student had time to finish ticket 7210, which adds support for expressions to filter() and update(). This means you'll be able to execute queries in the form of:


SELECT * FROM table WHERE height > width;

or similar UPDATE queries. This has a syntax similar to that of Q objects, using a new F object. So the above query would look like:


Model.objects.filter(height__gt=F('width'))

or an update query could look like:


Employee.objects.update(salary=F('salary')*1.1)

these objects support the full range of arithmetic operations. These are slated to be a part of Django 1.1.

Thursday, November 27, 2008

What aggregates are going to look like

Prior to Django 1.0 there was a lot of discussion of what the syntax for doing aggregate queries would like. Eventually a syntax was more or less agreed upon, and over the summer Nicolas Lara implemented this for the Google Summer of Code project, mentored by Russell Keith-Magee. This feature is considered a blocker for Django 1.1, so I'm going to outline what the syntax for these aggregates will be.

To facillitate aggregates two new methods are being added to the queryset, aggregate and annotate. Aggregate is used to preform basic aggregation on queryset itself, for example getting the MAX, MIN, AVG, COUNT, and SUM for a given field on the model. Annotate is used for getting information about a related model.

For example, if we had a product model with a price field, we could get the max and minimum price for a product by doing the following:


Product.objects.aggregate(Min('price'), Max('price'))

this will return something like this:


{'price__min': 23.45,
 'price__max': 47.89,
}

We can also give the results aliases, so it's easier to read(if no alias is provided it fallsback to using fieldname__aggregate:


Product.objects.aggregate(max_price = Max('price'), min_price = Min('price'))
{'min_price': 23.45,
 'max_price': 47.89,
}

You can also do aggregate queries on related fields, but the idea is the same, return a single value for each aggregate.

In my opinion, annotate queries are far more interesting. Annotate queries let us represent queries such as, "give me all of the Tags that more than 3 objects have been tagged with", which would look like:


Tag.objects.annotate(num_items=Count('tagged')).filter(num_items__gt=3)

This would return a normal queryset where each Tag object has an attribute named num_items, that was the Count() of all of tagged for it(I'm assuming tagged is a reverse foreign key, to a model that represents a tagged relationship). Another query we might want to execute would be to see how many awards authors of each author's publisher had won, this would look like:


Author.objects.annotate(num_publisher_awards=Count('publisher__authors__awards')).order_by('num_publisher_awards')

This is a little more complicated, but just like when using filter() we can chain this __ syntax. Also, as you've probably noticed we can filter and order_by these annotated attributes the same as we can with regular fields.

If you're interested in seeing more of how this works, Nicolas Lara has written some documentation and doc tests that you can see here. For now none of this is in the Django source tree yet, but there is a patch with the latest work on ticket 366.

Happy thanksgiving!

Some thoughts on Blogging

As we come to the close of November, I thought I'd take a comment to reflect on blogging. For one, this is by far more writing than I've ever done. Also, who thought blog-everyday-month should be November, Thanksgiving at the end really makes it hard to finish strong. Right now I'm pretty brain dry, hopefully I'll think of something good to finish off the month. For now I'm just trying to enjoy the long weekend.

I should probably take a chance to say thanks to Brian Rosner, Michael Trier, and anyone else who wouldn't stop pestering me to get a blog.

Wednesday, November 26, 2008

Home Sweet Home

In the interest of keeping up with post-a-day, I figured at a minimum I'd have a post explaining why I didn't have a real post for today. I just got home from college today for Thanksgiving, so I've been busy today, sorry :( .

Monday, November 24, 2008

A timeline view in Django

One thing a lot of people want to do in Django is to have a timeline view, that shows all the objects of a given set of models ordered by a common key. Unfortunately the Django ORM doesn't have a way of representing this type of query. There are a few techniques people use to solve this. One is to have all of the models inherit from a common baseclass that stores all the common information, and has a method to get the actual object. The problem with this is that it could execute either O(N) or O(N*k) queries, where N is the number of items and k is the number of models. It's N if your baseclass has the subtype it is stored on it, in which case you can directly grab it, else it's N*k since you have to try each type. Another approach is to use a generic relation, this will also need O(N) queries since you need to get the related object for each generic one. However, there's a better solution.

What we can do is use get a queryset for each of the models we want to display(O(k) queries), sorted on the correct key, and then use a simple merge to combine all of these querysets into a single list, comparing on a given key. While this technically may do more operations than the other methods, it does fewer database queries, and this is often the most difficult portion of your application to scale.

Let's say we have 3 models, new tickets, changesets, and wikipage edits(what you see in a typical Trac install). We can get our querysets and then merge them like so:


def my_view(request):
   tickets = Ticket.objects.order_by('create_date')
   wikis = WikiEdit.objects.order_by('create_date')
   changesets = Changeset.objects.order_by('create_date')
   objs = merge(tickets, wikis, changesets, field='create_date')
   return render_to_response('my_app/template.html', {'objects': objs})

Now we just need to write our merge function:


def merge_lists(left, right, field=None):
    i, j = 0, 0
    result = []
    while (i < len(left) and j < len(right)):
        if getattr(left[i], field) <= getattr(right[j], field):
            result.append(left[i])
            i += 1
        else:
            result.append(right[j])
            j += 1
    result.extend(left[i:])
    result.extend(right[j:])
    return result

def merge(*querysets, **kwargs):
    field = kwargs.pop('field')
    if field is None:
        raise TypeError('you need to provide a key to do comparisons on')
    if len(querysets) == 1:
        return querysets[0]
    
    qs = [list(x) for x in querysets]
    q1, q2 = qs.pop(), qs.pop()
    result = merge_lists(q1, q2, field)
    for q in qs:
        result = merge_lists(result, q)
    return result

There might be a more efficient way to write our merge function, but for now it merges together an arbitrary number of querysets on a given key.

And that's all their is too it. If you see a good way to make the merge function more efficient let me know, I would have liked to use Python's included heapq module, but it doesn't have a way to use a custom comparison function that I saw.

Sunday, November 23, 2008

Thinking about netbooks

At present I use a fairly powerful laptop as my all in one machine, it's my development environment, it's my gaming machine, and I use it for class. Before I came to college I used my desktop machine for everything. However, I'm beginning to think neither of these is the best solution. My laptop can do everything my desktop used to do, however it isn't as a good at being super portable as a laptop could be, nor does it have the potential to be a powerhouse like a desktop machine can be. Even though my laptop is a relatively balanced machine with a 15 inch screen I still feel like I'm making compromises on both ends, and I'm wondering if there's a better solution.

On the laptop side, my laptop isn't as lightweight as one could be, it weighs six or seven pounds. It also doesn't have as good a battery life as I'd like, 2 to 3 hours. On the desktop side it's not particularly upgradeable, meaning once it's out of date the whole thing needs to be replaced. On the other hand it does have some advantages over each, right now it's as powerful as it needs to be for anything I throw at it, and it is mobile enough that I can take it to my classes without having to worry.

The compromise I'm considering is whether to get a small netbook, such as the Asus EEE PC, for my mobile needs, and use my deskop for the heavy lifting. This has a few advantages, the EEE PC has a seven hour battery life, and weighs about three pounds. Plus my desktop runs things fine now, and I can upgrade individual components as needed. It has some drawbacks though, when I go places I can't bring my gaming machine with me(making going home for the holidays a big pain).

For now I'm not planning on changing anything about my setup, this is just thinking ahead for when my processing needs eclipse my current laptop(hopefully not another year or two). Does anyone use a setup like this, or just have a netbook? What are your thoughts?

A quick update

I've now set up Al to be using GMP for all integers, and I'll be doing the same for floats once they get implemented. I haven't started benchmarking yet, but it can compile and calculate the factorial of 50000 pretty quickly, and in Python vanilla that would result in a RuntimeError due to a stack overflow, so it's a good starting point. Sorry for such a short post, I'm pretty tired today.

Friday, November 21, 2008

My Programming Language - Status Update

Over the past few weeks I've been working on compiling my programming language. At present it works by translating the source into C++, and then you compile that with your compiler of choice. It's garbage collected, using the excellent Boehm GC library. At present it can only compile a limited subset of what it can actually parse, or what the interpreter supports. As of today thought it can compile and run a factorial function, however it can't calculate any factorial greater than 12, due to integer overflow issues. To solve this I'm either going to use GMP or roll my own Bignum library, and I'm not sure which yet. On the whole though, progress is good. The generated C++ is about as good as it could be considering the limitations inherent in turning an interpreted language into a compiled one. I haven't started benchmarking it yet, that was originally going to be the point of today's post before I ran into the integer overflow issues, however this is an example of the C++ code that is generated.

Give this Al(also valid Python):


def fact(n):
   if n == 1 or n == 0:
       return 1
   return n * fact(n-1)

print(fact(1))
print(fact(12))

It generated the following C++:


#include "src/base.h"

AlObj *fact;
class f0:public AlFunction
{
public:
 virtual AlObj * operator () (ARG_TYPE args, KWARG_TYPE kwargs)
 {
   AlObj *n = args.back ();
     args.pop_back ();
   if (*
 ((*((*(n)) == (AlObj *) (new AlInt (1))))
  || (*(n)) == (AlObj *) (new AlInt (0))))
     {
 return (AlObj *) (new AlInt (1));;
     }
   ARG_TYPE t0;
   t0.push_back ((*(n)) - (AlObj *) (new AlInt (1)));
   return (*(n)) * (*fact) (t0, KWARG_TYPE ());
 }
};

int
main ()
{
 fact = new f0 ();
 ARG_TYPE t1;
 ARG_TYPE t2;
 t2.push_back ((AlObj *) (new AlInt (1)));
 t1.push_back ((*fact) (t2, KWARG_TYPE ()));
 (*print) (t1, KWARG_TYPE ());
 ARG_TYPE t3;
 ARG_TYPE t4;
 t4.push_back ((AlObj *) (new AlInt (12)));
 t3.push_back ((*fact) (t4, KWARG_TYPE ()));
 (*print) (t3, KWARG_TYPE ());
}

All said and done, I'm pretty impressed! You can get all the code here, all the compilation work is in the code-generation branch.

Thursday, November 20, 2008

Why I don't use easy_install

First things first, this post is not meant as a flame, nor should it indicate to you that you shouldn't use it, unless of course you're priorities are perfectly aligned with my own. That being said, here are the reasons why I don't use easy_install, and how I'd fix them.

No easy_uninstall. Zed mentioned this in his PyCon '08 lightning talk, and it's still true. Yes I can simply remove these files, and yeah I could write a script to do it for me. But I shouldn't have to, if I can install packages, I should be able to uninstall packages, without doing any work.
I can't update all of my currently installed packages. For any packages I don't have explicitly installed to a particular version(which to it's credit, easy_install makes very easy to do), it should be very to upgrade all of these, because I probably want to have them up to date, and I can always lock them at a specific version if I want.
I don't want to have two package managers on my machine. I run Ubuntu, so I already have apt-get, which i find to be a really good system(and doesn't suffer from either of the aforementioned problems). Having two packages managers inherently brings additional confusion, if a package is available in both which do I install it from? It's an extra thing to remember to keep up to date(assuming #2 is fixed), and it's, in general, an extra thing to think about, every time I go to update anything on my machine.

So what's my solution? PyPi is a tremendous resource for Python libraries, and there are great tools in Python for working with it, for example using a setup.py file makes it incredibly easy to get your package up on PyPi, and keep it up to date. So there's no reason to throw all that stuff out the window. My solution would be for someone to set up a server that mirrored all the data from PyPi, regularly, and then offered the packages as .deb's(for Debian/Ubuntu users, and as RPMs for Fedora users, etc...). That way all a user of a given package manager can just add the URL to their sources list, and then install everything that's available from PyPi, plus they derive all of the benefits of their given package manager(for me personally, the ability to uninstall and batch upgrade).

Note: I'm not suggesting everyone use apt-get, I'm merely suggesting everyone use their native package manager, and there's no reason easy_install/pip/virtualenv can't also be used.

Wednesday, November 19, 2008

Uncoupled code is good, but doesn't exist

Code should try to be as decoupled from the code it depends as possible, I want me C++ to work with any compiler, I want my web framework to work with any ORM, I want my ORM to work with any database. While all of these are achievable goals, some of the decoupling people are searching for is simply not possible. At DjangoCon 2008 Mark Ramm made the argument that the Django community was too segregated from the Python community, both in terms of the community itself, and the code, Django for example doesn't take enough advantage of WSGI level middlewear, and has and ORM unto itself. I believe some of these claims to be true, but I ultiamtely thing the level of uncoupling some people want is simply impossible.

One of Django's biggest selling features has always been it's automatically generated admin. The admin requires you to be using Django's models. Some people would like it to be decoupled. To them I ask, how? It's not as if Django's admin has a big if not isinstance(obj, models.Model): raise Exception, it simply expects whatever is passed to it to define the same API as it uses. And this larger conecern, the Django admin is simply an application, it has no hooks within Django itself, it just happens to live in that namespace, the moment any application does Model.objects.all(), it's no longer ORM agnostic, it's already assumed the usage of a Django ORM. However, all this means is that applications themselves are inextricably tied to a given ORM, templating language, and any other module they import, you quite simply can't write resonably code that works just as well with two different modules unless they both define the same API.

Eric Florenzano wrote a great blog post yesterday about how Django could take better advantage of WSGI middleware, and he's absolutely correct. It makes no sense for a Django project to have it's own special middlewear for using Python's profiling modules, when it can be done more generic a level up, all the code is in Python afterall. However, there are also things that you can't abstract out like that, because they require a knowledge of what components you are using, SQLAlchemy has one transation model, Django has another.

The fact that an application is tied to the modules it uses is not an argument against it. A Django application is no tightly coupled to Django's ORM and template system is than a Turbo Gears application that uses SQL Alchemy and Mako, which is to say of course they're tied to it, they import those modules, they use them, and unless the other implements the same API you can't just swap them out. And that's not a bad thing.

Tuesday, November 18, 2008

What Python learned from economics

I find economics to be a fairly interesting subject, mind you I'm bored out of my mind about hearing about the stock markets, derivatives, and whatever else is on CNBC, but I find what guys like Steven Levitt and Steve E. Landsburg do to be fascinating. A lot of what they write about is why people do what they do, and how to incentivise people to do the right thing. Yesterday I was reading through David Goodger's Code Like a Pythonista when I got to this portion:

LUKE: Is from module import * better than explicit imports?

YODA: No, not better. Quicker, easier, more seductive.

LUKE: But how will I know why explicit imports are better than the wild-card form?

YODA: Know you will when your code you try to read six months from now.

And I realized that Python had learned a lot from these economists.

It's often dificult for a programmer to see the advantage of doing something the right way, which will be benneficial in six months, over just getting something done now. However, Python enforces doing things the right way, and when doing things the right way is just as easy as doing in the wrong way, you make the intuitive decision of doing things the right way. Almost every code base I've worked with(outside of Python) had some basic indentation rules that the code observed, Python just encodes this into the language, which requires all code to have a certain level of readability.

Django has also learned this lesson. For example, the template language flat out prevents you from putting your business logic inside of it without doing some real work, you don't want to do that work, so you do things the right way and put your business logic in your views. Another example would be database queries, in Django it would be harder to write a query that injected unescaped into your SQL than it would be do the right thing at use parameterized queries.

Ultimately, this is why I like Python. The belief that best practices shouldn't be optional, and that they shouldn't be difficult creates a community where you actively want to go and learn from people's code. Newcomers to the language aren't encouraged to "just get something working, and then clean it up later," the communiity encourages them to do it right in the first place, and save themselves the time later.

Monday, November 17, 2008

Running the Django Test Suite

This question came up in IRC yesterday, so I figured I'd run through it today. Django has a very extensive test suite that tests all of the components of Django itself. If you've ever written a patch for Django you probably know that tests are a requirement, for both new features and bug fixes. I'm going to try to run down how to setup your own testing envrioment.

First you need to have Django installed somewhere, for now I'll assume you have it in ~/django_src/. Somewhere on your python path, go ahead are use django-admin.py to create a new project. I've named this project django_test. Next, inside of that project create a folder named settings, and move settings.py into that folder and renmae it __init__.py. The reason we're going to have a settings directory is so that we can have subsettings for individual scenarios. Put some default settings for the various field in there now, for example my default settings provides the necessary options for SQLite. Now if there are any other subsettings you wish to setup create a file for them in the settings directory, and at the top of this file put from django_test.settings import *, followed by whatever settings you wish to overide. For example I have a mysql.py that overides my default SQLite database settings with MySQL values.

Now that we have our test settings, go to the directory where you have Django installed. To run the tests do ./tests/runtests.py --settings=django_test.settings. You can also use the DJANGO_SETTINGS_MODULE enviromental variable in place of passing in the settings like this. You can also provide a verbosity argument, the default is 0, I prefer to run with verbosity=1 because this keeps you up to date on the progress of the tests, without too much extraneus output.

Usually when you are working on a patch you don't need to run the entire test suite, your patch only affects a few tests. Therefore, you can provide a list of tests you'd like to run, so for example you could do ./tests/runtests.py --settings=django_test.settings.mysql -v 1 model_forms model_formsets. This will run the model_forms and model_formsets tests, with your mysql settings, at verbosity level 1.

And that's all it takes to run the Django test suite.

Sunday, November 16, 2008

What I'm excited about in Django 1.1

This past week, Jacob Kaplan-Moss put together the list of all of the features proposed for Django 1.1, and began to solicit comments on them. This is going to be a list of features I'm excited about.

Making admin URLs reversable. Currently the Django Admin uses a bit of a convulted scheme to route URLs. The proposal is to have them work using the current URL scheme. This is something I've been working on for a while, and hoping to see it to completion.
Comment-utils inclusion. The proposal is to include the moderation level features from comment-utils in Django. I think this is a great idea, and can't wait to see what sort of spam check schemes people implement once moderation facilities are included.
Message passing for anonymous users. This is basically session level message passing. This is something I've had to implement in past, so I'm looking forward to this.
ORM aggregation, as part of the Google Summer of Code Nicolas Lara, mentored by Russell Keith-Magee, implemented this. I love the API design, and it's a hugely requested feature, I can't wait to point new users to the docs, rather than explaining to them that it's coming soon.
ORM expression support, this work was also done by Nicolas Lara, and will let you do things like Model.objects.filter(height__gt=F('width')), or Model.objects.update(salary = F('salary')*1.2).
Model Validation, before 1.0 I implemented unique, and unique_together checks for model forms. That's pretty much a massive special case of this.
Class-based generic views, often one of the first things a new user to Django will learn to do is create a view that simply wraps a generic view, in order to do some custom filtering on a queryset. This is a great solution for that usecase, however as users want to inject more and more flexibility into a generic view it can lead to a huge number of settings. Rather than this, subclassing a generic view could provide a nice clean solution.

These will all be great features, and there are many more proposed(you can see them all here), however these features only happen because people write code for them. If there's a feature you're excited about, or interested in making a reality, try to contribute to it, even if it's just writing some unit tests.

Saturday, November 15, 2008

Python Things

I wasn't really sure what to name today's post, but it's basically going to be nifty things you can do in Python, and general tips.

SystemExit, sys.exit() raises SystemExit, if you actually want to keep going, you can just catch this exception, nothing special about it.
iter(callable, terminal), basically if you use iter in this way, it will keep calling the callable until the callable returns terminal, than it beaks.
a < x < b , in Python you can chain comparison operators like this. That's the same as writing a < x and x < b.
dict(), amongst the other ways to instantiate a dictionary in Python, you can give it a list of two tuples, so for example, [('a', 2), ('b', 3')] becomes {'a': 2, 'b': 3}.
open(filename), is an iterable, each iteration yields another line.
If you don't need ordering, use set() instead of list(). set() has better runtime for just about every operation, so if you don't need the ordering, use it.
Python comes with turtle graphics. This probably doesn't matter to most people, but if you want to help get a kid into programming, import turtle can be a great way.
pdb, the Python debugger is simply invaluable, try: code that isn't working, except ExceptionThatGetsRaised: import pdb; pdb.set_trace() is all it takes to get started with the itneractive debugger.
webbrowser.open(url), this module is just cool, it opens up the users browser to the desired URL.

And those are my tips! Please share yours.

Friday, November 14, 2008

And now for the disclaimer

I've been discussing portions of how the Django internals work, and this is powerful knowledge for a Django user. However, it's also internals, and unless they are documented internals are not guaranteed to continue to work. That doesn't mean they break very frequently, they don't, but you should be aware that it has no guarantee of compatibility going forward.

Having said that, I've already discussed ways you can use this to do powerful things by using these, and you've probably seen other ways to use these in your own code. In my development I don't really balk at the idea of using the internals, because I track Django's development very aggressively, and I can update my code as necessary, but for a lot of developers that isn't really an options. Once you deploy something you need it to work, so your options are to either lock your code at a specific version of Django, or not using these internals. What happens if you want to update to Django 1.1 for aggregation support, but 1.1 also removed some internal helper function you were using. Something similar to this happened to django-tagging, before the queryset-refactor branch was merged into trunk there was a helper function to parse portions of the query, and django-tagging made use of this. However, queryset-refactor obsoleted this function and removed, and so django-tagging had to update in order to work going forward, needing to either handle this situation in the code itself, or to maintain two separate branches.

In my opinion, while these things may break, they are worth using if you need them, because they let you do very powerful things. This may not be the answer for everyone though. In any event I'm going to continue writing about them, and if they interest you Marty Alchin has a book coming out, named Pro Django, that looks like it will cover a lot of these.

Thursday, November 13, 2008

Django Models - Digging a Little Deeper

For those of you who read my last post on Django models you probably noticed that I skirted over a few details, specifically for quite a few items I said we, "added them to the new class". But what exactly does that entail? Here I'm going to look at the add_to_class method that's present on the ModelBase metaclass we look at earlier, and the contribute_to_class method that's present on a number of classes throughout Django.

So first, the add_to_class method. This is called for each item we add to the new class, and what it does is if that has a contribute_to_class method than we call that with the new class, and it's name(the name it should attach itself to the new class as) as arguments. Otherwise we simply set that attribute to that value on the new class. So for example new_class.add_to_class('abc', 3), 3 doesn't have a contribute_to_class method, so we just do setattr(new_class, 'abc', 3).

The contribute_to_class method is more common for things you set on your class, like Fields or Managers. The contribute_to_class method on these objects is responsible for doing whatever is necessary to add it to the new class and do it's setup. If you remember from my first blog post about User Foreign Keys, we used the contribute_to_class method to add a new manager to our class. Here we're going to look at what a few of the builtin contribute_to_class methods do.

The first case is a manager. The manager sets it's model attribute to be the model it's added to. Then it checks to see whether or not the model already has an _default_manager attribute, if it doesn't, or if it's creation counter is lower than that of the current creation counter, it sets itself as the default manager on the new class. The creation counter is essentially a way for Django to keep track of which manager was added to the model first. Lastly, if this is an abstract model, it adds itself to the abstract_managers list in _meta on the model.

The next case is if the object is a field, different fields actually do slightly different things, but first we'll cover the general field case. It also, first, sets a few of it's internal attributes, to know what it's name is on the new model, additionally calculating it's column name in the db, and it's verbose_name if one isn't explicitly provided. Next it calls add_field on _meta of the model to add itself to _meta. Lastly, if the field has choices, it sets the get_FIELD_display method on the class.

Another case is for file fields. They do everything a normal field does, plus some more stuff. They also add a FileDescriptor to the new class, and they also add a signal receiver so that when an instance of the model is deleted the file also gets deleted.

The final case is for related fields. This is also the most complicated case. I won't describe exactly what this code does, but it's biggest responsibility is to set up the reverse descriptors on the related model, those are nice things that let you author_obj.books.all().

Hopefully this gives you a good idea of what to do if you wanted to create a new field like object in Django. For another example of using these techniques, take a look at the generic foreign key field in django.contrib.contenttypes, here.

Wednesday, November 12, 2008

What software do I use?

Taking a page from Brain Rosner's book, today I'm going to overview the software I use day to day. I'm only going to cover stuff I use under Ubuntu, I keep Windows XP on my system for gaming, but I'm not going to cover it here.

Ubuntu, I've been using the current version, Intrepid Ibex, since Alpha 4, and I love it. You quite simply couldn't get me to go back to Windows.
Python, it's my go to language, I fell in love about 14 months ago and I'm never going to leave it.
Django, it's my framework of choice, it's simple, clean, and well designed.
g++, C++ is the language used in my CS class, so I use my favorite free compiler.
gnome-do, this is an incredibly handy application, similar to Quicksilver for OS X, it makes simple things super fast, stuff like spawning the terminal, posting a tweet, searching for a file, or calling on Google's awesome calculator.
Firefox, the tried and true free browser, I also have to thank, Gmail Notifier, Firebug, Download them All, and Reload Every.
Chatzilla, I figured this extension deserved it's own mention, I use it almost 24/7 and couldn't live without it.
Gedit, who would think that the text editor that came with my OS would be so great?
VLC and Totem, you guys are both great, VLC is a bit nicer for playing flvs, but I love Totem's ability to search and play movies from Youtube.
Skype, makes it easy to get conference calls going with 5 friends, couldn't live without it.

As you can see most of the software I use is open source. I don't imagine anything I use is very outside the mainstream, but all of these projections deserve a round of applause for being great.

Monday, November 10, 2008

How the Heck do Django Models Work

Anyone who has used Django for just about any length of time has probably used a Django model, and possibly wondered how it works. The key to the whole thing is what's known as a metaclass, a metaclass is essentially a class that defines how a class is created. All the code for this occurs is here. And without further ado, let's see what this does.

So the first thing to look at is the method, __new__, __new__ is sort of like __init__, except instead of returning an instance of the class, it returns a new class. You can sort of see this is the argument signature, it takes cls, name, bases, and attrs. Where __init__ takes self, __new__ takes cls. Name is a string which is the name of the class, bases is the class that this new class is a subclass of, and attrs is a dictionary mapping names to class attributes.

The first thing the __new__ method does is check if the new class is a subclass of ModelBase, and if it's not, it bails out, and returns a normal class. The next thing is it gets the module of the class, and sets the attribute on the new class(this is going to be a recurring theme, getting something from the original class, and putting it in the right place on the new class). Then it checks if it has a Meta class(where you define your Model level options), it has to look in two places for this, first in the attrs dictionary, this is where it will be if you stick your class Meta inside your class. However, because of inheritance, we also have to check if the class has an _meta attribute already(this is where Django ultimately stores a bunch of internal information), and handle that scenario as well.

Next we get the app_label attribute, for this we either use the app_label attribute in the Meta class, or we pull it out of sys.modules. Lastly(at least for Meta), we build an instance of the Options class(which lives at django.db.models.options.Option) and add it to the new class as _meta. Next, if this class isn't an abstract base class we add the DoesNotExist and MultipleObjectsReturned exceptions to the class, and also inherit the ordering and get_latest_by attributes if we are a subclass.

Now we start getting to adding fields and stuff. First we check if we have an _default_manager attribute, and if not, we set it to None. Next we check if we've already defined the class, and if we have, we just return the class we already created. Now we go through each item that's left in the attrs dictionary and class the add_to_class method with it on the new class. add_to_class is a piece of internals that you may recognize from my first two blog posts, and what exactly it does I'll explain exactly what it does in another most, but at it's most basic level it adds each item in the dictionary to the new class, and each item knows where exactly it needs to get put.

Now we do a bunch of stuff to deal with inherited models. We iterate through ever item in bases, that's also a subclass of models.Model, and do the following: if it doesn't have an _meta attribute, we ignore it. If the parent isn't an abstract base class, if we already have a OneToOne field to it we set it up as a primary key, otherwise we create a new OneToOne field and install it as a primary key for the model. And now, if it is an abstract class, we iterate through the fields, if any of these fields has a name that is already defined on our class, we raise an error, otherwise we add that field to our class. And now we move managers from the parents down to the new class. Essentially we juts copy them over, and we also copy over virtual fields(these are things like GenericForeignKeys, which doesn't actually have a database field, but we still need to pass down and setup appropriately).

And then we do a few final pieces of cleanup. We make sure our new class doesn't have abstract=True in it's _meta, even if it's inherited from an abstract class. We add a few methods(get_next_in_order, and other), we inherit the docstring, or set a new one, and we send the class prepared signal. Finally, we register the model with Django's model loading system, and return the instance in Django's model cache, this is to make sure we don't have duplicate copies of the class floating around.

And that's it! Obviously I've skirted over how exactly somethings occur, but you should have a basic idea of what occurs. As always with Django, the source is an excellent resource. Hopefully you have a better idea of what exactly happens when you subclass models.Model now.

Getting Started With PLY - Part 3

As promised, today we'll be looking at implementing additional arithmetic operations, dealing with order of operations, and adding variables to our languages, so without further ado, let's jump into the code.

We can replace our old addition rule with this:


import operator
def p_expression_arithmetic(p):
    '''
    expression : expression PLUS expression
               | expression MINUS expression
               | expression TIMES expression
               | expression DIVIDE expression
    '''
    OPS = {
        '+': operator.add,
        '-': operator.sub,
        '*': operator.mul,
        '/': operator.div
    }
    p[0] = OPS[p[2]](p[1], p[3])

Hopefully what this code does is pretty clear, the | operator in the rule is an or option. So if we match any of these, we get the correct function out of our ops dictionary(if you aren't familiar with operator module check it out, it's awesome), and then call it with the two arguments.

This handles the arithmetic correctly, but doesn't handle order of operations, so lets add that in:


precedence = (
    ('left', 'PLUS', 'MINUS'),
    ('left', 'TIMES', 'DIVIDE'),
)

What this says is all these operations are left-associative, and TIMES and DIVIDE have a high precedence than PLUS and MINUS(both groupings have equal precedence, and thus read left to right).

Now that we have a fully functioning calculator, let's add in variables, first we need to add a token for NAMES(variables) and for the assignment operator:


def t_NAME(t):
    r'[a-zA-Z_][a-zA-Z_0-9]*'
    return t

t_EQ = r'='

And of course add NAME and EQ to the list of tokens, and now a few parsing rules:


names = {}

def p_expression_name(p):
    '''
    expression : NAME
    '''
    p[0] = names[p[1]]

def p_assignment(p):
    '''
    assignment : NAME EQ expression
    '''
    names[p[1]] = p[3]

So here we define a names dictionary, it will map variables to values. Hopefully the parse rules are fairly obvious, and everything makes sense.

Sunday, November 9, 2008

Getting Started With PLY - Part 2

Yesterday we created are tokens, and using these we can parse our language(which right now is a calculator) into some tokens. Unfortunately this isn't very useful. So today we are going to start writing a grammar, and building an interpreter around it.

In PLY, grammar rules are defined similarly to tokens, that is, using docstrings. Here's what a few grammar rules for out language might look like:


def p_expression_plus(p):
   '''
   expression : expression PLUS expression
   '''
   p[0] = p[1] + t[3]

def p_expression_number(p):
   '''
   expression : NUMBER
   '''
   p[0] = [1]

So the first docstring works is, an expression is defined as expression PLUS expression. Here PLUS is the token we defined earlier, and expression is any other way we've defined expression, so an expression is also a number(which is the token we defined earlier). The way the code works is essentially that p[0] is the result, and each piece of the definition is it's own subscript, so p[1] and p[3] refer to the two expression in the plus expression we defined.

To actually use this parser we've defined we do:


parser = yacc.yacc()
if __name__ == '__main__':
    while True:
        try:
            s = raw_input('calc > ')
        except EOFError:
            break
        if not s:
            continue
        result = parser.parse(s)
        print result

Try it out! As an exercise, the reader can implement other operations(remember the order of operations!), and perhaps variable. Tomorrow, I'll be discussing implementing these. As always, the PLY documentation is excellent, and available here.

Saturday, November 8, 2008

Getting Started With PLY

The other day I mentioned I was using PLY in my post about building a language, so today I'm going to describe getting started with PLY, specifically the tokenization phase. For those who don't know much about parsing a language, the tokenization phase is where we take the source file, and turn it into a series of tokens. For example, turning a = 3 + 4 into, NAME EQUALS 3 PLUS 4. As you can see that simple assignment becomes 5 tokens, each number is a token, both numbers are tokens, and a is a NAME token. So how do we do this in PLY?

PLY's method for defining tokenization rules is very creative. First you define a list of tokens, for example:


tokens = (
   'NUMBER',
   'PLUS',
   'MINUS',
   'TIMES',
   'DIVIDE',
)

Here we have defined the types of tokens we will define, what each of these is should be self explanatory. Then we define some rules, they look like this:


t_PLUS    = r'\+'
t_MINUS   = r'-'
t_TIMES   = r'\*'
t_DIVIDE  = r'/'
def t_NUMBER(t):
    r'\d+'
    try:
         t.value = int(t.value)    
    except ValueError:
  t.value = 0
    return t

This is probably less obvious. There are 2 ways to define the rules for a token, either as a string, or as a function. Either way they are named t_TOKEN_NAME. For a lot of tokens you can just do the string, those are the ones that don't require processing, and the string is just a regex that matches the token. For things that do need processing, we can define a function. The function takes 1 parameter, which is a lexer objecct, as you can see in our example, we take in t, and since we are defining a number token we set the value to be the integer for the string representation from the source code. The interesting thing here is how we define the rule for, PLY uses the docstring for a function to get the regex for it.

Now that we have all of our rules set up we need to actually build the lexer object:


lexer = lex.lex()

And then we can use the input() function on a lexer to provide the source code, and the token function to pop the next token off the lexer.

That's all for today, in the future we'll take a look at the other components of building the grammar of a language, and at how we implement it. For more information now, PLY has excellent documentation, available here.

Friday, November 7, 2008

That's not change we can believe in

Yesterday president-elect Obama's campaign unveiled their transitional website, change.gov. So, as someone who's interested in these things I immediately began to look at what language, framework, or software package they were using. The first thing I saw was that they were using Apache, however beyond that there were no distinctive headers. None of the pages had tell-tale extensions like .php or .aspx. However, one thing that struck me was that most pages were at a url in the form of /page/*/, which is the same format of the Obama campaign website, which I knew was powered by Blue State Digital's CMS. On the Obama campaign's site however, there were a few pages with those tell-tale .php extension, so I've come to the conclusion that the new site also uses PHP. And to that I say, that's not change we can believe in.

PHP has been something of a powerhouse in web development for the last few years, noted for it's ease of deployment and quick startup times, it's drawn in legions of new users. However, PHP has several notable flaws. Firstly, it doesn't encourage best practices, ranging from things like code organization(PHP currently has no concept of namespaces), to database security(the included mysql database adapter doesn't feature parameterized queries), and beyond. However, this isn't just another post to bash on PHP(as much as I'd like to do one), there are already plenty of those out there. This post is instead to offer some of the benefits of switching, to Python, or Ruby, or whatever else.

You develop faster. Using a framework like Django, or Rails, or TurboGears let's you do things very quickly.
You get the benefits of the community, with Django you get all the reusable applications, Rails has plugins, TurboGears has middleware. Things like these quite simply don't exist in the PHP world.
You get a philosophy. As far as I can tell, PHP has no philosophy, however both Python and Ruby do, and so do their respective frameworks. Working within a consistant philsophy makes development remarkably more sane.

If you currently are a user of PHP, I beg of you, take a chance, try out Ruby or Python, or whatever else. Give Django, or TurboGears, or Rails a shot. Even if you don't end up liking it, or switching, it's worth giving it a shot.

Thursday, November 6, 2008

Building a Programming Language with Python

One of my side projects of late has been building a programming language in Python, using the PLY library. PLY is essentially a Python implementation of the classic Lex and Yacc tools. The language, at present, has a syntax almost exactly the same as Python's(the notable difference, in so far as features that have been implemented, is that you are not allowed to have multiple statements on the same line, or to put anything following a colon on the same line). The language(currently called 'Al' although that's more of a working name), is a dynamic language that builds up an syntax tree for the code, and than executes it. However, the long term goal is to have it actually be a compiled language, similar to Lisp or C. Essentially the mechanism for doing this will be the same as how a C++ compiler handles multiple dispatch, which is dynamically at run time.

At present however, this mythical fully compiled language is far from complete, I haven't even began to think about the assembly generation, mostly because I don't know assembly at all, and one of the courses I will be taking next semester is one which covers assembly code. However, the question that has to be asked, are what are the advantages of a compiled language, and what are the costs?

First the benefits:

It's faster, even a worst case C++ program that fully utilizes multiple dispatch at runtime will go faster than a program using the same algorithms in Python.
You get an executable at the end. This is a huge advantage for distribution, you don't need to distribute the source code, and you have an exe to give to people.

There are probably others, but I'm assuming the semantics of a language similar to Python, so I haven't included things like compile time type checking. And now the disadvantages:

You lose some of the dynamicism. Doing things like eval(), or dynamic imports is inherently harder, if not impossible.
You lose the REPL(interactive interpreter).

So can we overcome those? As far as I can tell the first should be doable, eval() necessitates the inclusion of an interpreter with the language, the thought of this already has to be making people think this is just going to end up as a VM. But, I think this can be overcome, we can know, at compile time, whether or not a user will be using eval, and decide then whether or not to compile the interpreter and link against it. Dynamic imports are, if anything harder, I think, I think this is just an issue of doing run time linking, but I'm not sure. As for the issue of the REPL, this is a non-issue as far as I'm concerned, there is no inherent reason a compiled language can't have a REPL, we just often don't, languages like Common Lisp have long had both.

So now, let's see some code. I hope to have some code to show off, that handles at least a subset of Python, for PyCon 2009, as work begins on assembly generation I will post here. For anyone interested in the code at present, you can see it here.

Wednesday, November 5, 2008

PyGTK and Multiprocessing

Yesterday was election day, and for many people that meant long nights following the results, waiting to see who would be declared the next president of the United States of America. Politics is a game of numbers, and it's nice to offload the crunching to our computers. I had written up a simple application for projecting win likelihood for the candidates based on the likelihood of a win in an individual state. If you are interested in the application itself you can see it here. However this post is going to look at the new multiprocessing library, and how I used it with PyGTK.

Part of my application is that whenever you update a probability for a given candidate in a given state it recomputes their win percentage for the election as a whole. To make this as accurate as possible it runs multiple simulations of the scenario to compute the win percentage. Originally I was running these computations in the same thread as the GUI work and I found that I could only do about 250 simulations before it had a drastically negative impact on usability. So the next step was to offload these calculations into another process.

To go about this I created an Updater class which is a subclass of multiprocessing.Process. It takes a pipe as it's only argument, and it's run method just loops forever polling the pipe for new projection results, tabulating them, and then sending the projection back through the pipe.

In the main process the application obviously starts by creating a duplex pipe, spawning the second process(and giving it the pipe). Then, using the facilities of the gobject library, it sets up a method that checks for new projection results and updates the GUI to be executed whenever the main thread is idle(gobject.idle_add). And lastly the signal responder that gets called whenever the user changes some data simply marshals up the necessary data, and sets it through the pipe to the other process.

And that's all, in total I believe it was under 25 lines of code changed to make my application use a separate process for calculation.

EDIT: Upon request, this is the diff where I made the original changes, several subsequent commits will better reflect what is described here though.

Tuesday, November 4, 2008

More Laziness with Foreign Keys

Yesterday we looked at building a form field to make the process of getting a ForeignKey to the User model more simple, and to provide us with some useful tools, like the manager. But this process can be generalized, and made more robust. First we want to have a lazy ForeignKey field for all models(be careful not to confuse the term lazy, here I use it to refer to the fact that I am a lazy person, not the fact that foreign keys are lazy loaded).

A more generic lazy foreign key field might look like:


    from django.db.models import ForeignKey, Manager

    class LazyForeignKey(ForeignKey):
        def __init__(self, *args, **kwargs):
            model = kwargs.get('to')
            if model_name is None:
                model = args[0]
            try:
                name = model._meta.object_name.lower()
            except AttributeError:
                name = model.split('.')[-1].lower()
            self.manager_name = kwargs.pop('manager_name', 'for_%s' % name)
            super(ForeignKey, self).__init__(*args, **kwargs)
       
        def contribute_to_class(self, cls, name):
            super(ForeignKey, self).contribute_to_class(cls, name)
           
            class MyManager(Manager):
                def __call__(self2, obj):
                    return cls._default_manager.filter(**{self.name: obj})
           
            cls.add_to_class(self.manager_name, MyManager())

As you can see, a lot of the code is the same as before. Most of the new code is in getting the mode's name, either through _meta, or through the last part of the string(i.e. User in "auth.User"). And now you will have a manager on your class, named either for_X where X is the name of the model the foreign key is to lowercase, or named whatever the kwarg manager_name is.

So if your model has this:


    teacher = LazyForeignKey(Teacher)

You would be able to do:


    MyModel.for_teacher(Teacher.objects.get(id=3)

That's all for today. Since tonight is election night, tomorrow I'll probably post about my application election-sim, and about PyGTK and PyProcessing(aka multiprocessing).

Monday, November 3, 2008

Lazy User Foreign Keys(this is a double entendre)

A *very* common pattern in Django is for models to have a foreign key to django.contrib.auth.User for the owner(or submitter, or whatever other relation with User) and then to have views that filter this down to the related objects for a specific user(often the currently logged in user). If we think ahead, we can make a manager with a method to filter down to a specific user. But since we are really lazy we are going to make a field that automatically generates the foreign key to User, and gives us a manager, automatically, to filter for a specific User, and we can reuse this for all types of models.

So what does the code look like:


    from django.db.models import ForeignKey, Manager

    from django.contrib.auth.models import User

    class LazyUserForeignKey(ForeignKey):
        def __init__(self, **kwargs):
            kwargs['to'] = User
            self.manager_name = kwargs.pop('manager_name', 'for_user')
            super(ForeignKey, self).__init__(**kwargs)
       
        def contribute_to_class(self, cls, name):
            super(ForeignKey, self).contribute_to_class(cls, name)
           
            class MyManager(Manager):
                def __call__(self2, user):
                    return cls._default_manager.filter(**{self.name: user})
           
            cls.add_to_class(self.manager_name, MyManager())

So now, what does this do?

We are subclassing ForeignKey. In __init__ we make sure to is set to User and we also set self.manager_name equal to either the manager_name kwarg, if provided or 'for_user'. contribute_to_class get called by the ModelMetaclass to add each item to the Model itself. So here we call the parent method, to get the ForeignKey itself set on the model, and then we create a new subclass of Manager. And we define an __call__ method on it, this lets us call an instance as if it were a function. And we make __call__ return the QuerySet that would be returned by filtering the default manager for the class where the user field is equal to the given user. And then we add it to the class with the name provided earlier.

And that's all. Now we can do things like:


    MyModel.for_user(request.user)

Next post we'll probably look at making this more generic.