So we want to write a function that returns the number of items each person has relinked, excluding items they themselves linked. So I started off with the most simple implementation I could, and this was the end result:
from collections import defaultdict
from operator import itemgetter
from django.utils.html import word_split_re
from logger.models import Message
def calculate_relinks():
"""
Calculate the number of times each individual has linked something that was
linked previously in the course of the chat.
"""
links = defaultdict(int)
for message in Message.objects.all().order_by('-time').iterator():
words = word_split_re.split(message.message)
for word in words:
if word.startswith('http'):
if Message.objects.filter(time__lt=message.time).filter(message__contains=word).exclude(speaker=message.speaker).count():
links[message.speaker] += 1
links = sorted(links.iteritems(), key=itemgetter(1), reverse=True)
return links
Here I iterated over the messages and for each one I went through each of the words and if any of them started with http(the definition of a link for my purposes) I checked to see if this had ever been linked before by someone other than the author of the current message.
This took about 4 minutes to execute on my dataset, it also executed about 10,000 SQL queries. This is clearly unacceptable, you can't have a view that takes that long to render, or hits your DB that hard. Even with aggressive caching this would have been unmaintainable. Further this algorithm is O(n**2) or thereabouts so as my dataset grows this would have gotten worse exponentially.
By changing this around however I was able to achieve far better results:
from collections import defaultdict
from operator import itemgetter
from django.utils.html import word_split_re
from logger.models import Message
def calculate_relinks():
"""
Calculate the number of times each individual has linked something that was
linked previously in the course of the chat.
"""
links = defaultdict(set)
counts = defaultdict(int)
for message in Message.objects.all().filter(message__contains="http").order_by('time').iterator():
words = word_split_re.split(message.message)
for word in words:
if word.startswith('http'):
if any([word in links[speaker] for speaker in links if speaker != message.speaker]):
counts[message.speaker] += 1
links[message.speaker].add(word)
counts = sorted(counts.iteritems(), key=itemgetter(1), reverse=True)
return counts
Here what I do is go through each of the messages which contain the string "http"(this is already a huge advantage since that means we process about 1/6 of the messages in Python that we originally were), for each message we go through each of the words in it, and for each that is a link we check if any other person has said it by looking in the caches we maintain in Python, and if they do we increment their count, finally we add the link to that persons cache.
By comparison this executes in .3 seconds, executes only 1 SQL query, and it will scale linearly(as well as is possible). For reference both of these functions are compiled using Cython. This ultimately takes almost no work to do and for computationally heavy operations this can provide a huge boon.
The list comp there is redundant; just use a genexp (which can shortcircuit the searching also).
ReplyDeleteYou're applying a filter() on all() when you could skip the call to all() and filter() the Manager directly.
ReplyDeleteBrian:
ReplyDeleteThe reason I did a list comprehension in place of a generator expression is that I am compiling these as C extensions using Cython which doesn't support generator expressions.
William:
This is true, but that's has almost no overhead whatsoever, I tend to find it slightly more readable, although that is also completely negligible.
"Further this algorithm is O(n**2) or thereabouts so as my dataset grows this would have gotten worse exponentially."
ReplyDeleteNo, 2**n would be exponentially. This is not a context where you can use the colloquial meaning.