Tuesday, November 10, 2009

The State of MultiDB (in Django)

As you, the reader, may know this summer I worked for the Django Software Foundation via the Google Summer of Code program. My task was to implement multiple database support for Django. Assisting me in this task were my mentors Russell Keith-Magee and Nicolas Lara (you may recognize them as the people responsible for aggregates in Django). By the standards of the Google Summer of Code project my work was considered a success, however, it's not yet merged into Django's trunk, so I'm going to outline what happened, and what needs to happen before this work is considered complete.

Most of the major things happened, settings were changed from a series of DATABASE_* to a DATABASES setting that's keyed by DB aliases and who's values are dictionaries containing the usual DATABASE* options, QuerySets grew a using() method which takes a DB alias and says what DB the QuerySet should be evaluated against, save() and delete() grew similar using keyword arguments, a using option was added to the inner Meta class for models, transaction support was expanded to include support for multiple databases, as did the testing framework. In terms of internals almost every internal DB related function grew explicit passing of the connection or DB alias around, rather than assuming the global connection object as they used to. As I blogged previously ManyToMany relations were completely refactored. If it sounds like an awful lot got done, that's because it did, I knew going in that multi-db was a big project and it might not all happen within the confines of the summer.

So if all of that stuff got done, what's left? Right before the end of the GSOC time frame Russ and I decided that a fairly radical rearchitecting of the Query class (the internal datastructure that both tracks the state of an operation and forms its SQL) was needed. Specifically the issue was that database backends come in two varieties. One is something like a backend for App Engine, or CouchDB. These have a totally different design than SQL, they need different datastructures to track the relevant information, and they need different code generation. The second type of database backend is one for a SQL database. By contrast these all share the same philosophies and basic structure, in most cases their implementation just involves changing the names of database column types or the law LIMIT/OFFSET is handled. The problem is Django treated all the backends equally. For SQL backends this meant that they got their own Query classes even though they only needed to overide half of the Query functionality, the SQL generation half, as the datastructure part was identical since the underlying model is the same. What this means is that if you make a call to using() on a QuerySet half way through it's construction you need to change the class of the Query representation if you switch to a database with a different backend. This is obviously a poor architecture since the Query class doesn't need to be changed, just the bit at the end that actually constructs the SQL. To solve this problem Russ and I decided that the Query class should be split into two parts, a Query class that stores bits about the current query, and a SQLCompiler which generated SQL at the end of the process. And this is the refactoring that's holding up the merger of my multi-db work primarily.

This work is largely done, however the API needs to be finalized and the Oracle backend ported to the new system. In terms of other work that needs to be done, GeoDjango needs to be shown to shown to still work (or fixed). In my opinion everything else on the TODO list (available here, please don't deface) is optional for multi-db to be merge ready, with the exception of more example documentation.

There are already people using the multi-db branch (some even in production), so I'm confident about it's stability. For the next 6 weeks or so (until the 1.2 feature deadline), my biggest priority is going to be getting this branch into a merge ready state. If this is something that interests you please feel free to get in contact with me (although if you don't come bearing a patch I might tell you that I'll see you in 6 weeks ;)), if you happen to find bugs they can be filed on the Django trac, with version "soc2009/multidb". As always contributors are welcome, you can find the absolute latest work on my Github and a relatively stable version in my SVN branch (this doesn't contain the latest, in progress, refactoring). Have fun.

7 comments:

  1. thank you so much for working on this.

    ReplyDelete
  2. Sounds very exciting! Thanks for your hard work on this feature. In my opinion, this addresses one of the biggest problems with Django. Best of luck getting this code merged in by the 1.2 deadline. =)

    ReplyDelete
  3. Nice work! I Look forward to the final product....

    ReplyDelete
  4. Good luck getting it to 1.2, can't wait to play with this.

    Awesome work

    ReplyDelete
  5. Thank you again for working on this! We can't wait until it's merged in!

    ReplyDelete
  6. Good luck, Alex! 1.2 without multidb is not going to be the same 1.2 it could be... We will be testing your branch next week with a fresh integration project, so hopefully something interesting will come up.

    ReplyDelete

Note: Only a member of this blog may post a comment.