November 16, 2006

The game of life

Okay, so, I'm unimaginative, but I like the game of life. It's nifty, looks cool, captures the attention of people (like, say, me) who're attracted to shiny things going "Pling!", and stresses interesting parts of a system. It was one of the first real things we did for Parrot, and for the longest time it was one of the benchmark programs. Who knows, maybe it still is, I dunno.

Anyway, I've been working on a GOL for Tornado. It works too, which is cool, and it's a little surreal to do a full generation with no explicit loops in it anywhere. And, as expected, it's shaken out a number of bugs in Tornado, both in the ops that're used by the program itself, but also in the error handling (which is, alas, pretty close to nonexistent at the moment), the assembler, and most importantly the garbage collector. Or something the GC tickles, at least, as memory usage grows until there isn't any more, and performance grinds to a halt.

It may sound a little strange, but I like programs that kill my VMs, especially if they're small and viciously abusive. Those are the best kind, as they wrap up all sorts of nasty edge cases and stressors in one handy package. Makes debugging easier, makes testing easier, and shakes out lots of bugs in one go so you don't have to spend all that much time writing test cases, which leaves more time for debugging the code.

There's a second handiness here, in that life, if you look at it right, is a vector program. (Hence the being able to write it with no loops) If you play it with a torroidal board, as is traditional, and you have a vector rotation op (which may not be traditional, but is pretty darned simple), it's just a series of vector rotates, adds, and compares. Turns out that a generation takes 34 Tornado ops to run, regardless of the board size. And yeah, 34 lines of code for life isn't all that great, but as assembly goes, not at all bad, really.

This sorted out all sorts of GC issues, some minor flow control issues, an off-by-one error in the assembler, and a variety of bits'n'pieces.

The runtime isn't too bad either, though I can't say as I'm thrilled with it. On my spiffy new MacBook Pro (2GHz Core 2 Duo processor), for 500,000 generations, I get:

10x10: 12 sec
10x20: 16 sec
10x40: 28 sec
10x80: 50 sec

which, honestly, isn't that swell, though you can see that dealing with larger quantities of data scales pretty nicely -- the 12 seconds for a 10x10 board was pretty bad, honestly, but the steps as the board gets larger make it clear that there's a lot of overhead in the 500K generations, but the overhead doesn't grow linearly with the amount of data.

Still, you might think that 10K generations/sec on a 10x80 board still isn't that great on a 2 GHz processor, and you'd be right. It isn't, and I don't much like it. (Bleah!)

Luckily for me, there's a rabbit and hat handy. Those numbers are with GCC with no optimizations on. With -O3, the time for a 10x80 board drops to 17 seconds, which is a more respectable 29K generations. Still not as nice as I'd like, but definitely better. (This is the place where I question the algorithm I use, as well as considering Duff's Devicing the loops in the vector ops, but I'm putting that off for now, because I want to write a templating preprocessor to do that, so I can do 'em all in one go)

I should be cutting a 0.06 release in the next day or two, time permitting.

Update: Now that I go do the math, a 10x80 board has 800 cells, and at 29k gen/sec (okay, 29411.76 gen/sec) that's 23M cells/sec considered. Or about 85 cycles/cell to do the figuring on this beastie. Okay, maybe it's not all that bad, unless I screwed up my numbers.

Posted by Dan at 07:12 AM | Comments (0) | TrackBack