Last week Norman Nunley got me
interested in the ICFP challenge 2006, despite it
being over. It involved implementing a Universal Machine following a (hilarious) spec.
This was needed to run a program that you were given; this program would then give you
instructions on what to do from there.
I had never written a virtual machine before: I always imagined it too hard to even try. I
was wrong. It took me around 7-8 hours; spread over an evening, the following morning and
a couple of hours at work (sorry boss, it was too addictive—I blame Norman); to get a
working machine. I did most of the work alone, but I needed Norman’s help to clarify parts
of the spec and in some debugging.
I was amazed when it turned out this “huge scary thing”, which I’d imagined a VM to be,
clocked in at less than 300 lines of C. It wasn’t fast though. A self-test and
benchmarking program for the UM was available from the ICFP
website. Norman’s UM ran this benchmark in minutes but mine, distressingly, took over 10
Last Saturday, after eliminating a few variables and assignments in the hotpath and using
macroes instead, I managed to get the time for the benchmark program down to just over two
hours. A whole lot better, but still way slower than others’ VMs. From basic
instrumentation I knew that the benchmark program does a lot of memory allocation and
deallocation, and I surmised that this the cause of my performance problems. During the
course of the weekend I tried various things to speed my VM up further by caching
previously allocated arrays in various ways. Whatever I tried, however, simply made it
slower. It was a humbelling experience that reminded me that I should have faith in the
library implementors; they’re most likely smarter than me.
The performance of my VM continued to bug me and Monday morning. My VM has an array of
pointers to unsigned integer arrays, which the benchmark program required a whole bunch
of. Indexes into the parent array are used to identify which leaf array it required and
therefore I could not change the index of a leaf array and when a leaf is freed up I end
up with a sparse parent array. My fallacy was to use a sequential scan to find the first
free location in the parent array, when allocating a new leaf; as it turned out, this was
really expensive. When I tried to just allocate new leaf arrays at the end of the parent
array and simply allowing the parent array to be very sparse at the front, my VM suddenly
ran the benchmark in just over three minutes. A considerable speedup, but it didn’t come
for free. As I didn’t reuse any of the slots in the parent array the memory usage grew
steeply to about 140 MB, up from around 3 MB throughout the entire run previously.
The morale is clear: Find your bottlenecks, then eliminate them; don’t try to eliminate
imagined bottlenecks (I already knew this, but it doesn’t hurt to be reminded once in a
while). I originally assumed that my performance issues was related to memory allocation.
I was partly right, but it was not due to
free() being slow, as I had
assumed. It was down to my stupid algorithm for finding somewhere to stash the newly
Update: By making use of C99’s flexible array member and introducing an
extremely simple free list I’ve now managed to get the memory usage for my VM down to 2.16
MB (as reported by top on my Powerbook) for the benchmark run, at a cost of adding about
40–50 seconds to the run time. The flexible array member feature allows me to do only one
allocation instead of two when allocating an array wrapped in a struct; this can yield
large memory savings if you’re allocating a lot of structs containing short arrays. The
free list simply holds indices to free slots in the parent array.
Update, 2007/09/03: Here’s my Virtual
Machine, should you wish to take a