Sunday, November 22, 2009

A Bit of Benchmarking

PyPy recently posted some interesting benchmarks from the computer language shootout, and in my last post about Unladen Swallow I described a patch that would hopefully be landing soon. I decided it would be interesting to benchmarks something with this patch. For this I used James Tauber's Mandelbulb application, at both 100x100 and 200x200. I tested CPython, Unladen Swallow Trunk, Unladen Swallow Trunk with the patch, and a recent PyPy trunk (compiled with the JIT). My results were as follows:

CPython 2.6.4 17s
Unladen Swallow Trunk 16s
Unladen Swallow Trunk + Patch 13s
PyPy Trunk 10s

CPython 2.6.4 64s
Unladen Swallow Trunk 52s
Unladen Swallow Trunk + Patch 49s
PyPy 46s

Interesting results. At 100x100 PyPy smokes everything else, and the patch shows a clear benefit for Unladen. However, at 200x200 both PyPy and the patch show diminishing returns. I'm not clear on why this is, but my guess is that something about the increased size causes a change in the parameters that makes the generated code less efficient for some reason.

It's important to note that Unladen Swallow has been far less focussed on numeric benchmarks than PyPy, instead focusing on more web app concerns (like template languages). I plan to benchmark some of these as time goes on, particularly after PyPy merges their "faster-raise" branch, which I'm told improves PyPy's performance on Django's template language dramatically.


  1. Hey.

    PyPy has fairly bad string operations cost by now (things like ''.join(lst) are 2x slower than CPython). I'm working on it on some branch already, so rather wait until this gets merged. Right now for templating languages pypy is very likely to be even slower than CPython (as it is for spitfire for example).

  2. Holy crap.

    If I change array("B") into a plain list, pypy goes couple x faster. I guess we should move array from pure python to RPython at some point. (Until then, it barely makes sense to compare a bit).


  3. There are a lot of 3 second differences between programs that stay 3 seconds across tests. Are these 3 seconds representative of performance differences or something about the way the setup internally is initiated in each?


Note: Only a member of this blog may post a comment.