PyPy recently posted some interesting benchmarks from the computer language shootout, and in my last post about Unladen Swallow I described a patch that would hopefully be landing soon. I decided it would be interesting to benchmarks something with this patch. For this I used James Tauber's Mandelbulb application, at both 100x100 and 200x200. I tested CPython, Unladen Swallow Trunk, Unladen Swallow Trunk with the patch, and a recent PyPy trunk (compiled with the JIT). My results were as follows:
100
CPython 2.6.4 17s
Unladen Swallow Trunk 16s
Unladen Swallow Trunk + Patch 13s
PyPy Trunk 10s
200
CPython 2.6.4 64s
Unladen Swallow Trunk 52s
Unladen Swallow Trunk + Patch 49s
PyPy 46s
Interesting results. At 100x100 PyPy smokes everything else, and the patch shows a clear benefit for Unladen. However, at 200x200 both PyPy and the patch show diminishing returns. I'm not clear on why this is, but my guess is that something about the increased size causes a change in the parameters that makes the generated code less efficient for some reason.
It's important to note that Unladen Swallow has been far less focussed on numeric benchmarks than PyPy, instead focusing on more web app concerns (like template languages). I plan to benchmark some of these as time goes on, particularly after PyPy merges their "faster-raise" branch, which I'm told improves PyPy's performance on Django's template language dramatically.
Sunday, November 22, 2009
Saturday, November 21, 2009
Things College Taught me that the "Real World" Didn't
A while ago Eric Holscher blogged about things he didn't learn in college. I'm going to take a different spin on it, looking at both things that I did learn in school that I wouldn't have learned else where (henceforth defined as my job, or open source programming), as well as thinks I learned else where instead of at college.
Things I learned in college:
Big O notation, and algorithm analysis. This is the biggest one, I've had little cause to consider this in my open source or professional work, stuff is either fast or slow and that's usually enough. Learning rigorous algorithm analysis doesn't come up all the time, but every once in a while it pops up, and it's handy.
C++. I imagine that I eventually would have learned it myself, but my impetus to learn it was that's what was used for my CS2 class, so I started learning with the class then dove in head first. Left to my own devices I may very well have stayed in Python/Javascript land.
Finite automaton and push down automaton. I actually did lexing and parsing before I ever started looking at these in class (see my blog posts from a year ago) using PLY, however, this semester I've actually been learning about the implementation of these things (although sadly for class projects we've been using Lex/Yacc).
Things I learned in the real world:
Compilers. I've learned everything I know about compilers from reading my papers from my own interest and hanging around communities like Unladen Swallow and PyPy (and even contributing a little).
Scalability. Interesting this is a concept related to algorithm analysis/big O, however this is something I've really learned from talking about this stuff with guys like Mike Malone and Joe Stump.
APIs, Documentation. These are the core of software development (in my opinion), and I've definitely learned these skills in the open source world. You don't know what a good API or documentation is until it's been used by someone you've never met and it just works for them, and they can understand it perfectly. One of the few required, advanced courses at my school is titled, "Software Design and Documentation" and I'm deathly afraid it's going to waste my time with stuff like UML, instead of focusing on how to write APIs that people want to use and documentation that people want to read.
So these are my short lists. I've tried to highlight items that cross the boundaries between what people traditionally expect are topics for school and topics for the real world. I'd be curious to hear what other people's experience with topics like these are.
Thursday, November 19, 2009
Another Pair of Unladen Swallow Optimizations
Today a patch of mine was committed to Unladen Swallow. In the past weeks I've described some of the optimizations that have gone into Unladen Swallow, in specific I looked at removing the allocation of an argument tuple for C functions. One of the "on the horizon" things I mentioned was extending this to functions with a variable arity (that is the number of arguments they take can change). This has been implemented for functions that take a finite range of argument numbers (that is, they don't take *args, they just have a few arguments with defaults). This support was used to optimize a number of builtin functions (dict.get, list.pop, getattr for example).
However, there were still a number of functions that weren't updated for this support. I initially started porting any functions I saw, but it wasn't a totally mechanical translation so I decided to do a little profiling to better direct my efforts. I started by using the cProfile module to see what functions were called most frequently in Unladen Swallow's Django template benchmark. Imagine my surprise when I saw that unicode.encode was called over 300,000 times! A quick look at that function showed that it was a perfect contender for this optimization, it was currently designated as a METH_VARARGS, but in fact it's argument count was a finite range. After about of dozen lines of code, to change the argument parsing, I ran the benchmark again, comparing it a control version of Unladen Swallow, and it showed a consistent 3-6% speedup on the Django benchmark. Not bad for 30 minutes of work.
Another optimization I want to look at, which hasn't landed yet, is one of optimize various operations. Right now Unladen Swallow tracks various data about the types seen in the interpreter loop, however for various operators this data isn't actually used. What this patch does is check at JIT compilation time whether the operator site is monomorphic (that is there is only one pair of types ever seen there), and if it is, and it is one of a few pairings that we have optimizations for (int + int, list[int], float - float for example) then optimized code is emitted. This optimized code checks the types of both the arguments that they are the expected ones, if they are then the optimized code is executed, otherwise the VM bails back to the interpreter (various literature has shown that a single compiled optimized path is better than compiling both the fast and slow paths). For simple algorithm code this optimization can show huge improvements.
The PyPy project has recently blogged about the results of the results of some benchmarks from the Computer Language Shootout run on PyPy, Unladen Swallow, and CPython. In these benchmarks Unladen Swallow showed that for highly algorithmic code (read: mathy) it could use some work, hopefully patches like this can help improve the situation markedly. Once this patch lands I'm going to rerun these benchmarks to see how Unladen Swallow improves, I'm also going to add in some of the more macro benchmarks Unladen Swallow uses to see how it compares with PyPy in those. Either way, seeing the tremendous improvements PyPy and Unladen Swallow have over CPython gives me tremendous hope for the future.
However, there were still a number of functions that weren't updated for this support. I initially started porting any functions I saw, but it wasn't a totally mechanical translation so I decided to do a little profiling to better direct my efforts. I started by using the cProfile module to see what functions were called most frequently in Unladen Swallow's Django template benchmark. Imagine my surprise when I saw that unicode.encode was called over 300,000 times! A quick look at that function showed that it was a perfect contender for this optimization, it was currently designated as a METH_VARARGS, but in fact it's argument count was a finite range. After about of dozen lines of code, to change the argument parsing, I ran the benchmark again, comparing it a control version of Unladen Swallow, and it showed a consistent 3-6% speedup on the Django benchmark. Not bad for 30 minutes of work.
Another optimization I want to look at, which hasn't landed yet, is one of optimize various operations. Right now Unladen Swallow tracks various data about the types seen in the interpreter loop, however for various operators this data isn't actually used. What this patch does is check at JIT compilation time whether the operator site is monomorphic (that is there is only one pair of types ever seen there), and if it is, and it is one of a few pairings that we have optimizations for (int + int, list[int], float - float for example) then optimized code is emitted. This optimized code checks the types of both the arguments that they are the expected ones, if they are then the optimized code is executed, otherwise the VM bails back to the interpreter (various literature has shown that a single compiled optimized path is better than compiling both the fast and slow paths). For simple algorithm code this optimization can show huge improvements.
The PyPy project has recently blogged about the results of the results of some benchmarks from the Computer Language Shootout run on PyPy, Unladen Swallow, and CPython. In these benchmarks Unladen Swallow showed that for highly algorithmic code (read: mathy) it could use some work, hopefully patches like this can help improve the situation markedly. Once this patch lands I'm going to rerun these benchmarks to see how Unladen Swallow improves, I'm also going to add in some of the more macro benchmarks Unladen Swallow uses to see how it compares with PyPy in those. Either way, seeing the tremendous improvements PyPy and Unladen Swallow have over CPython gives me tremendous hope for the future.
Labels:
django,
programming-languages,
pypy,
python,
unladen-swallow
Announcing django-admin-histograms
This is just a quick post because it's already past midnight here. Last week I realized some potentially useful code that I extracted from the DjangoDose and typewar codebases. Basically this code let's you get simple histogram reports for models in the admin, grouped by a date field. After I released it David Cramer did some work to make the code slightly more flexible, and to provide a generic templatetag for creating these histograms anywhere. The code can be found on github, and if you're curious what it looks like there's a screenshot on the wiki. Enjoy.
Tuesday, November 17, 2009
Writing a Lexer
People who have been reading this blog since last year (good lord) may recall that once upon a time I did a short series of posts on lexing and parsing using PLY. Back then I was working on a language named Al. This past week or so I've started working on another personal project (not public yet) and I've once again had the need to lex things, but this time I wrote my lexer by hand, instead of using any sort of generator. This has been an exceptional learning experience, so I'd like to pass some of that on to you.
The first thing to note is that writing a lexer is a great place to TDD (test driven development), I've rewritten various parts of my lexer five or more times, I've needed my tests to keep me sane. Got your tests written? Ok it's time to dive right into our lexer.
I've structured my lexer as a single class that takes an input string, and has a parse method which returns a generator that yields tokens (tokens are just a namedtuple with a name and value field). The parser has two important attributes, state which is a string that says what state the lexer is in (this is used for tokens that are more than one character long), and current_val which is a list containing characters that will eventually become the value for the current token being found.
The parse method iterates through characters in the text and then it checks, if the parser has a state (self.state is not None) it does getattr(self, self.state)(character). Otherwise it calls self.generic(character). Then the various "state methods" are responsible for mutating self.current_val and self.state and returning a Token. So for example the string state looks like this:
If the character is a quote then we're closing our string so we return our string Symbol, reset the current_val and state. If the character is a \ then we switch into a string_escaped state which knows to handle the character as a literal and then go back to string state. If the character is anything else then we just append it to the current_val, it will get handled at the end of the string.
I've found this to be an exceptionally powerful method, and it makes my end result code very readable. Hopefully I'll be able to reveal my project in the coming weeks, as I'm very excited about it, even if it's not ready I'll continue to share these lessons learned as I go.
The first thing to note is that writing a lexer is a great place to TDD (test driven development), I've rewritten various parts of my lexer five or more times, I've needed my tests to keep me sane. Got your tests written? Ok it's time to dive right into our lexer.
I've structured my lexer as a single class that takes an input string, and has a parse method which returns a generator that yields tokens (tokens are just a namedtuple with a name and value field). The parser has two important attributes, state which is a string that says what state the lexer is in (this is used for tokens that are more than one character long), and current_val which is a list containing characters that will eventually become the value for the current token being found.
The parse method iterates through characters in the text and then it checks, if the parser has a state (self.state is not None) it does getattr(self, self.state)(character). Otherwise it calls self.generic(character). Then the various "state methods" are responsible for mutating self.current_val and self.state and returning a Token. So for example the string state looks like this:
def string(self, ch):
if ch == '"':
sym = Symbol("STRING", "".join(self.current_val))
self.current_val = []
self.state = None
return sym
elif ch == "\\":
self.state = "string_escaped"
else:
self.current_val.append(ch)
If the character is a quote then we're closing our string so we return our string Symbol, reset the current_val and state. If the character is a \ then we switch into a string_escaped state which knows to handle the character as a literal and then go back to string state. If the character is anything else then we just append it to the current_val, it will get handled at the end of the string.
I've found this to be an exceptionally powerful method, and it makes my end result code very readable. Hopefully I'll be able to reveal my project in the coming weeks, as I'm very excited about it, even if it's not ready I'll continue to share these lessons learned as I go.
Monday, November 16, 2009
My Next Blog
I've been using this Blogspot powered blog for over a year now, and it is starting to get a little long in the tooth. Therefore I've been planning on moving to a new, shinny, blog of my own in the coming months. Specifically it's going to be my goal to get myself 100% migrated over my winter break. Therefore I've started building myself a list of features I want:
- Able to migrate all my old posts. This isn't going to be a huge deal, but it has to happen as I have quite a few posts I don't want to lose.
- Accepts restructured text. I'm sick of writing my posts in reST and then converting it into HTML for Blogspot.
- Pretty code highlighting.
- Disqus for comments. I don't want to have to deal with spam or anything else, let users post with Disqus and they can deal with all the trouble.
- Looks decent. Design is a huge weak point for me, so most of my time is going to be dedicated to this I think.
- Django powered!
That's my pony list thus far. There's a good bet I'll use django-mingus since I've hear such good things about it, but for now I'm just dreaming of being able to write these posts in reST.
Sunday, November 15, 2009
Initial Review: Python Essential Reference
Disclosure: I received a free review copy of Python Essential Reference, Fourth Edition.
I've never really used reference material, I've always loved tutorials, howtos, and guides for learning things, but I've usually shunned reference material in favor of reading the source. Therefore, I didn't think I'd have a huge use for this book. However, so far (I've read about half the book so far) I've found it to be an exceptional resource, and I definitely plan on keeping it on my bookshelf.
The first third or so of the book is a reference on the syntax and other basic constructs of Python, it's probably not the part of the book you'll be consulting very frequently if you're an experienced Python programmer, however the end of this section is a bit of "Testing, Debugging, Profiling, and Tuning", this I can see myself flipping back to, as it extensively documents the doctests, unittests, pdb, cProfile, and dis modules.
The next third of the book is all about the Python library, including both the builtins and the standard library. This section is organized by functionality and I can definitely see myself using it. For example it has sections on "Python Runtime Services" (like atexit, gc, marshal, and weakref), "Data Structures, Algorithms, and Code Simplification" (bisect, collections, heapq for example), "String and Text Handling" (codecs, re, struct), and "Python Database Access" (PEP249, sqlite, and dbm). There's more, but this is as far as I've read. Reading through like a novel each of these sections has exposed me to things I wasn't aware of or don't use as frequently as I should, and I plan on using this book as a resource for exploring them. David Beazley has painstakingly documented the details of these modules, paying particular attention to the functions and classes you are likely to need most.
All in all I've found the Python Essential Reference to be a good book, especially for people who like reference documentation. Depending on how you use Python this book can serve as an excellent eye opener into other parts of the language and standard library, and for me I think that's where a ton of value will come from, as a day to day Python user I don't need a reference for most of the language, but for the bits it's introducing me to, having it handy will be a leg up.
I've never really used reference material, I've always loved tutorials, howtos, and guides for learning things, but I've usually shunned reference material in favor of reading the source. Therefore, I didn't think I'd have a huge use for this book. However, so far (I've read about half the book so far) I've found it to be an exceptional resource, and I definitely plan on keeping it on my bookshelf.
The first third or so of the book is a reference on the syntax and other basic constructs of Python, it's probably not the part of the book you'll be consulting very frequently if you're an experienced Python programmer, however the end of this section is a bit of "Testing, Debugging, Profiling, and Tuning", this I can see myself flipping back to, as it extensively documents the doctests, unittests, pdb, cProfile, and dis modules.
The next third of the book is all about the Python library, including both the builtins and the standard library. This section is organized by functionality and I can definitely see myself using it. For example it has sections on "Python Runtime Services" (like atexit, gc, marshal, and weakref), "Data Structures, Algorithms, and Code Simplification" (bisect, collections, heapq for example), "String and Text Handling" (codecs, re, struct), and "Python Database Access" (PEP249, sqlite, and dbm). There's more, but this is as far as I've read. Reading through like a novel each of these sections has exposed me to things I wasn't aware of or don't use as frequently as I should, and I plan on using this book as a resource for exploring them. David Beazley has painstakingly documented the details of these modules, paying particular attention to the functions and classes you are likely to need most.
All in all I've found the Python Essential Reference to be a good book, especially for people who like reference documentation. Depending on how you use Python this book can serve as an excellent eye opener into other parts of the language and standard library, and for me I think that's where a ton of value will come from, as a day to day Python user I don't need a reference for most of the language, but for the bits it's introducing me to, having it handy will be a leg up.
Subscribe to:
Posts (Atom)