Monday, November 24, 2008

A timeline view in Django

One thing a lot of people want to do in Django is to have a timeline view, that shows all the objects of a given set of models ordered by a common key. Unfortunately the Django ORM doesn't have a way of representing this type of query. There are a few techniques people use to solve this. One is to have all of the models inherit from a common baseclass that stores all the common information, and has a method to get the actual object. The problem with this is that it could execute either O(N) or O(N*k) queries, where N is the number of items and k is the number of models. It's N if your baseclass has the subtype it is stored on it, in which case you can directly grab it, else it's N*k since you have to try each type. Another approach is to use a generic relation, this will also need O(N) queries since you need to get the related object for each generic one. However, there's a better solution.

What we can do is use get a queryset for each of the models we want to display(O(k) queries), sorted on the correct key, and then use a simple merge to combine all of these querysets into a single list, comparing on a given key. While this technically may do more operations than the other methods, it does fewer database queries, and this is often the most difficult portion of your application to scale.

Let's say we have 3 models, new tickets, changesets, and wikipage edits(what you see in a typical Trac install). We can get our querysets and then merge them like so:

def my_view(request):
tickets = Ticket.objects.order_by('create_date')
wikis = WikiEdit.objects.order_by('create_date')
changesets = Changeset.objects.order_by('create_date')
objs = merge(tickets, wikis, changesets, field='create_date')
return render_to_response('my_app/template.html', {'objects': objs})


Now we just need to write our merge function:

def merge_lists(left, right, field=None):
i, j = 0, 0
result = []
while (i < len(left) and j < len(right)):
if getattr(left[i], field) <= getattr(right[j], field):
result.append(left[i])
i += 1
else:
result.append(right[j])
j += 1
result.extend(left[i:])
result.extend(right[j:])
return result

def merge(*querysets, **kwargs):
field = kwargs.pop('field')
if field is None:
raise TypeError('you need to provide a key to do comparisons on')
if len(querysets) == 1:
return querysets[0]

qs = [list(x) for x in querysets]
q1, q2 = qs.pop(), qs.pop()
result = merge_lists(q1, q2, field)
for q in qs:
result = merge_lists(result, q)
return result

There might be a more efficient way to write our merge function, but for now it merges together an arbitrary number of querysets on a given key.

And that's all their is too it. If you see a good way to make the merge function more efficient let me know, I would have liked to use Python's included heapq module, but it doesn't have a way to use a custom comparison function that I saw.

12 comments:

  1. How would you approach the issue of displaying different objects in the template depending on the type?

    ReplyDelete
  2. David: I would either use either the ducktype approach(give them all methods by the same name that output the correct information), or restrict yourself to just outputting information that is common to all the models.

    ReplyDelete
  3. You can make your merge function a more efficient by not turning it into a list. Particularly useful if you're not likely to consume all the input or the results are large (so, e.g, you're using qs.iterator()).

    Merge sort doesn't require knowing the length of the list. It only requires knowing which of the various "head" elements to use next. For N iterators, have a pool of N items that are the heads of each iterator. Storing them as a sorted list of pairs (head, iterator) works well. Then your merging iterator yields the head element of the first tuple in the list, then pops off the next head from that iterator and reinserts the (new_head, iterator) pair back into the sorted list. When popping off the next head throws StopIteration, you just throw that iterator away. Eventually the list will be exhausted and you're done.

    Given that your initial iterators are already sorted, the final merge will be O(# of results) if you do it in the above fashion (each result is handled only a very few times). Your version is still linear in the number of results, but has a worse constant factor than that because you iterate over "results" many times in the process (for N iterators, you essentially process the each result of iterator 1, N times instead of once).

    ReplyDelete
  4. This is why SQL has a UNION operator, right? It'd be nice to take advantage of that, and push the data handling off onto the database server.

    ReplyDelete
  5. For the PyCon proposal system I used the admin.models.LogEntry system which uses generic relations. This limited me to additions, changes, and deletions per model, but that was fine. The forms which modified the models would create log entries. This way it would capture all the events.

    I can then filter on the type of event and type of model. This is how the proposal system change history graph is generated:
    http://us.pycon.org/2009/conference/proposals/stats/

    I would say using a generic relations for event markers is the best way to go. Granted this means you need a greater level of integration, but the end speed and easy of the queries more than makes up for it. A little bit of work in .extra() can allow you to do some cool tricks as well, like return all unique dates with a count on the number of events of each type for that date (if that is all you want).

    ReplyDelete
  6. It looks to be very complex, here is what I have done for my merged feed (posts+thoughts):

    self.items = list(Post.published.all()[:5]) + list(Thought.published.all()[:10])
    self.items.sort(key=lambda item: item.publication_date, reverse=True)

    Isn't it easier? (or I probably missed something...)

    ReplyDelete
  7. @David: your method in general doesn't produce the 15 most recent items.

    ReplyDelete
  8. @Luke: thanks, you're absolutely right, but note that the example neither restrict to a number of items.

    ReplyDelete
  9. Cool I have done something like this. . have a couple of models on my blog where I sort by datetime field. The only limitation I found was all the models should have the same field name for datetime ( in your case create_date).

    I have used this snippet (http://www.djangosnippets.org/snippets/65/).

    which works fine for me. For cases where models have different datetime field names create a method like

    def create_date(self):
    self.create_date = self.pub_date

    which should hopefully work.

    ReplyDelete
  10. What abotu paginating using this approach? It's easy if you "allow" variable amounts of items on a page, but if you want to be consistent about the number of items on each page it's harder.

    Another approach is to use generic relations and use separate routines to poulate the content object of said relation.

    def populate(queryset, model):
    qs_pks = [i.pk for i in queryset]
    content_objects = {}
    for co in model._default_manager.filter(pk__in=qs_pks):
    content_objects[co.pk] = co
    for o in queryset:
    o.content_object = content_objects[o.content_object_id]

    ... or something to that effect. I haven't actually tested this as it just popped into my head, but it should work with 1 + N queries, as opposed to the more limited N-query version.

    I don't know :)

    ReplyDelete
  11. Maybe it was just me, but to get this to work correctly I had to change the merge_lists() function. Specifically, reverse the comparison portion (>= instead of <=). No biggy and it's working great, I appreciate the idea. I was just going to do separate RSS feeds, this is much nicer.

    ReplyDelete
  12. I'd rewrite the merge() like this:

    from operator import itemgetter

    def merge(*querysets, **kwargs):
    field = kwargs.pop('field')
    if field is None:
    raise TypeError('you need to provide a key to do comparisons on')
    key = itemgetter(field)
    qs = [sorted(x, key=key) for x in querysets]
    while len(qs) > 1:
    q1, q2 = qs.pop(0), qs.pop(0)
    q3 = merge_lists(q1, q2, key)
    qs.append(q3)
    # Or: qs.append(merge_lists(qs.pop(0), qs.pop(0), key))
    return qs.pop()

    def merge_lists(left, right, key):
    result = []
    while (len(left) and len(right)):
    which_list = (left if key(left[0]) <= key(right[0]) else right)
    result.append(which_list.pop(0))
    return result + left + right

    You can have a look at:

    http://wordaligned.org/articles/merging-sorted-streams-in-python

    for other ideas on sorting streams.

    ReplyDelete

Note: Only a member of this blog may post a comment.