SuperLoopy Logo

Rise of the Machine

Last week Norman Nunley got me interested in the ICFP challenge 2006, despite it being over. It involved implementing a Universal Machine following a (hilarious) spec. This was needed to run a program that you were given; this program would then give you instructions on what to do from there.

I had never written a virtual machine before: I always imagined it too hard to even try. I was wrong. It took me around 7-8 hours; spread over an evening, the following morning and a couple of hours at work (sorry boss, it was too addictive—I blame Norman); to get a working machine. I did most of the work alone, but I needed Norman’s help to clarify parts of the spec and in some debugging.

I was amazed when it turned out this “huge scary thing”, which I’d imagined a VM to be, clocked in at less than 300 lines of C. It wasn’t fast though. A self-test and benchmarking program for the UM was available from the ICFP website. Norman’s UM ran this benchmark in minutes but mine, distressingly, took over 10 hours.

Last Saturday, after eliminating a few variables and assignments in the hotpath and using macroes instead, I managed to get the time for the benchmark program down to just over two hours. A whole lot better, but still way slower than others’ VMs. From basic instrumentation I knew that the benchmark program does a lot of memory allocation and deallocation, and I surmised that this the cause of my performance problems. During the course of the weekend I tried various things to speed my VM up further by caching previously allocated arrays in various ways. Whatever I tried, however, simply made it slower. It was a humbelling experience that reminded me that I should have faith in the library implementors; they’re most likely smarter than me.

The performance of my VM continued to bug me and Monday morning. My VM has an array of pointers to unsigned integer arrays, which the benchmark program required a whole bunch of. Indexes into the parent array are used to identify which leaf array it required and therefore I could not change the index of a leaf array and when a leaf is freed up I end up with a sparse parent array. My fallacy was to use a sequential scan to find the first free location in the parent array, when allocating a new leaf; as it turned out, this was really expensive. When I tried to just allocate new leaf arrays at the end of the parent array and simply allowing the parent array to be very sparse at the front, my VM suddenly ran the benchmark in just over three minutes. A considerable speedup, but it didn’t come for free. As I didn’t reuse any of the slots in the parent array the memory usage grew steeply to about 140 MB, up from around 3 MB throughout the entire run previously.

The morale is clear: Find your bottlenecks, then eliminate them; don’t try to eliminate imagined bottlenecks (I already knew this, but it doesn’t hurt to be reminded once in a while). I originally assumed that my performance issues was related to memory allocation. I was partly right, but it was not due to malloc() or free() being slow, as I had assumed. It was down to my stupid algorithm for finding somewhere to stash the newly allocated array.

Update: By making use of C99’s flexible array member and introducing an extremely simple free list I’ve now managed to get the memory usage for my VM down to 2.16 MB (as reported by top on my Powerbook) for the benchmark run, at a cost of adding about 40–50 seconds to the run time. The flexible array member feature allows me to do only one allocation instead of two when allocating an array wrapped in a struct; this can yield large memory savings if you’re allocating a lot of structs containing short arrays. The free list simply holds indices to free slots in the parent array.

Update, 2007/09/03: Here’s my Virtual Machine, should you wish to take a look.