One thing that's really useful to have, when experimenting with new things, is an existing thing that does something like what you're trying to build. I don't mean something to copy, but rather something that you can judge your performance against as you work on an alternate implementation. There are many different ways to do a task, but it's silly to spend much time working on a way that's slow and clunky.
As a for example, as I dive back into working on Tornado, I dusted off the implementation of Conway's Game of Life. It's a nice piece to work with -- it's small and easy to get your head around, but it exercises much of the core of a VM, including array work, basic math, conditional expressions, resource allocation, and looping. It's also very easy to throw together a reference implementation of, as a way to get a handle on how the VM version runs relative to a compiled version.
Not, mind, that the VM version is likely to run as fast as the compiled version. There's overhead involved in most VM designs (without JITs, at least, and even having a JIT involves the overhead of on-the-fly compilation) so we're not looking for parity, but that doesn't mean we should settle for bad performance. So... I tossed together a quick GOL implementation and gave it a run. It's a standard version, one that calculates each cell completely sequentially. This is different from the Tornado version, which does nine offset passes, but that's fine, the results are the same.
The Tornado version does 50,000 generations of a 10x80 board in 1.45 seconds. That's with a perl wrapper, but there's only 0.045s of overhead there, which I think we can reasonably ignore.
Surprisingly, the C version does 50,000 generations of a 10x80 board in 1.58 seconds with unoptimized C. That I didn't expect, but judging against this isn't appropriate, since Tornado is compiled optimized. (Still... woo! :)
The C version, when built optimized, runs 50,000 generations of a 10x80 board in 1.28 seconds.
The C version is faster than the Tornado version, but only 12%. That... is quite nice, actually. While I expected it to be reasonably performant (the main loop to calculate a generation is only 34 VM instructions, regardless of the board size) I was expecting something more like a 40% penalty at this point in the implementation, and this is where having a benchmark implementation is important.
If the VM was running 70% or more slower than the base version, then it was time to toss out a lot of the core work. It wouldn't matter how much I liked the design, if it doesn't perform it's got to go, and the sooner that non-performant design gets thrown out the better -- the longer it lasts the more things depend on it and the harder it is to get rid of it.
So... yay for benchmark implementations. Make them and test your new implementations against them, and don't be afraid to throw out code no matter how much you like it.
Okay, so, I'm unimaginative, but I like the game of life. It's nifty, looks cool, captures the attention of people (like, say, me) who're attracted to shiny things going "Pling!", and stresses interesting parts of a system. It was one of the first real things we did for Parrot, and for the longest time it was one of the benchmark programs. Who knows, maybe it still is, I dunno.
Anyway, I've been working on a GOL for Tornado. It works too, which is cool, and it's a little surreal to do a full generation with no explicit loops in it anywhere. And, as expected, it's shaken out a number of bugs in Tornado, both in the ops that're used by the program itself, but also in the error handling (which is, alas, pretty close to nonexistent at the moment), the assembler, and most importantly the garbage collector. Or something the GC tickles, at least, as memory usage grows until there isn't any more, and performance grinds to a halt.
It may sound a little strange, but I like programs that kill my VMs, especially if they're small and viciously abusive. Those are the best kind, as they wrap up all sorts of nasty edge cases and stressors in one handy package. Makes debugging easier, makes testing easier, and shakes out lots of bugs in one go so you don't have to spend all that much time writing test cases, which leaves more time for debugging the code.
There's a second handiness here, in that life, if you look at it right, is a vector program. (Hence the being able to write it with no loops) If you play it with a torroidal board, as is traditional, and you have a vector rotation op (which may not be traditional, but is pretty darned simple), it's just a series of vector rotates, adds, and compares. Turns out that a generation takes 34 Tornado ops to run, regardless of the board size. And yeah, 34 lines of code for life isn't all that great, but as assembly goes, not at all bad, really.
This sorted out all sorts of GC issues, some minor flow control issues, an off-by-one error in the assembler, and a variety of bits'n'pieces.
The runtime isn't too bad either, though I can't say as I'm thrilled with it. On my spiffy new MacBook Pro (2GHz Core 2 Duo processor), for 500,000 generations, I get:
10x10: 12 sec
10x20: 16 sec
10x40: 28 sec
10x80: 50 sec
which, honestly, isn't that swell, though you can see that dealing with larger quantities of data scales pretty nicely -- the 12 seconds for a 10x10 board was pretty bad, honestly, but the steps as the board gets larger make it clear that there's a lot of overhead in the 500K generations, but the overhead doesn't grow linearly with the amount of data.
Still, you might think that 10K generations/sec on a 10x80 board still isn't that great on a 2 GHz processor, and you'd be right. It isn't, and I don't much like it. (Bleah!)
Luckily for me, there's a rabbit and hat handy. Those numbers are with GCC with no optimizations on. With -O3, the time for a 10x80 board drops to 17 seconds, which is a more respectable 29K generations. Still not as nice as I'd like, but definitely better. (This is the place where I question the algorithm I use, as well as considering Duff's Devicing the loops in the vector ops, but I'm putting that off for now, because I want to write a templating preprocessor to do that, so I can do 'em all in one go)
I should be cutting a 0.06 release in the next day or two, time permitting.
Update: Now that I go do the math, a 10x80 board has 800 cells, and at 29k gen/sec (okay, 29411.76 gen/sec) that's 23M cells/sec considered. Or about 85 cycles/cell to do the figuring on this beastie. Okay, maybe it's not all that bad, unless I screwed up my numbers.
So, I've released Tornado 0.05. This release, in addition to a few bug fixes, now properly handles callbacks into the embedding application. The perl module's been rev'd to handle this as well. This means you can call tornado code from your perl, you can get results back from that tornado code, and you can register perl subs as callbacks for that tornado code. Nifty!
There's one big limitation, which I'm unhappy with. For reasons that annoy me, when waiting on a handle for results, only callbacks posted in the function the handle represents will be processed. I'd hoped to have it so that callbacks from any function in that particular instance of Tornado would be processed, but I wasn't thinking clearly and put together a design that required waiting on multiple condition variables simultaneously, which just doesn't work with pthreads.
Of course, now after spending a few hours hacking that out so that each handle was independent, I realized I could have them all coordinate on a single instance-wide condition variable without much trouble. I'd rejected it for performance reasons -- it'd mean all the waiting threads would wake up and have to check to see if their individual wait condition was satisfied -- but on further reflection that was kind of silly, as there's unlikely to be many host threads waiting on the instance mutex at any one time, so the CPU time spent waking them all up and having them go back to sleep just isn't likely to be a big deal.
Anyway, this works, which is cool. Next step is either to get the Windows build working (I need to get Win-specific thread stuff in) or to get the threads-spawning-threads and messaging code in. Not sure which. I think I ought to get win-threads working first, but message-passing and thread-spawning is a lot more interesting.
I know it's an act of supreme, if likely wrongheaded, laziness, but rather than trying to figure out how to build libraries on Windows, or get libraries loading via relative paths right on OS X, I just hacked up the makefiles for the perl module and the main tornado distribution to directly reference the objects. That means you get the whole wad just linked into your perl module, and at some point I'll get library building done in a proper cross-platform way so you can have your .so / .dll / .dylib again.
In the mean time, I get to leverage perl's wad 'o platform-specific knowledge and just bodge all the objects together into whatever perl needs, and if someone twiddles with a ruby or python version they can do the same thing.
I'd feel guilty, except the module library still weighs in at under 140K. I think I can live with that, 'specially as the library built for Devel::Size is 57K and there's not nearly so much code in it.
Windows still doesn't quite build right, but if all goes well it will in the next day or so. Assuming you've got the same development environment I do (which'd be activestate perl and the .NET 2003 dev kit), and if you don't then I probably ought to make PPMs to go with the source tarball. 'Cause, what the heck, having the perl module makes testing this stuff a lot easier. Yay for dynamism and all.
(No, there's no particularly interesting insights here. This is all grunt work, though I am so looking forward to learning Windows threading model enough to get its thread stuff into Tornado...)
Just cut a release of Tornado 0.04 -- http://www.sidhe.org/~dan/things/tornado-0.04.tar.gz. More fixed bugs and whatnot, but the fun new thing now is that the perl module (in the perl subdirectory) builds and tests, and that includes assembling a sample program, invoking it, and getting the results back. This is, I think, a pretty cool thing.
Unfortunately there's still some manual bits, like you have to copy the tornado shared library into the lib/ directory in the perl module's build directory when you're building and testing, but that's only because I've forgotten the Weird Magic needed to get MakeMaker to add in library paths and so I cheat. It's probably dead-simple, but for now this is easy enough, and I can write test driver programs in perl rather than having to abuse the main.c in the main tornado source directory. (Which is terribly inflexible)
That does work out, in its own way. Use the flexible language to do flexible things, the fast (I hope) engine to do the fast things, and let the glue code sort out the interfacing. The way things really ought to be, I think.
I should know by now -- whenever I'm facing large masses of mostly-identical code it's a sign that it's time to do some templating.
I did this in tornado with the opcodes, in large part because I fully planned on having a simple oploop to start with, and more complex oploops later. I also knew I was going to need to yank out opcode info for the assembler and whatnot, so pre-processing the source made a lot of sense. (This was something that we did quite successfully with Parrot)
I didn't do it with the various MMD math ops. This is something I'm coming to regret.
Firstly, there's just a whole wad of cut'n'paste code in because of it, and that's always really error prone. Yech for errors. Fixing bugs in the boilerplate is always a pain when it's boiled over, and you know that you always find the bug in the code that's been copied everywhere but miss one or two spots in the cut'n'paste fix.
Second it really sucks when you realize that you want to roll out a functionality change, but you've got a half-zillion places you have to wedge it in because it's in the boilerplate.
Feh. Should'a done it in the first place. Now I have to go back and rip out the guts of math.c and replace it with code that handles shorter vectors on the RHS of an op for looping. And possibly the LHS too, I'm not sure. (There is the question of what you should do -- loop back to the beginning, or pad with zeroes, or just run the length of the shorter vector)
Ah, well, it's only code. No reason to not rip it out and make it better if there's an actual win to doing so, and it'll make the source code smaller too. Smaller is usually more comprehensible, and I like comprehensible.
(I really ought to cut an 0.04 release -- fixed a bug on the train and added in vector/scalar versions of the math ops. Maybe tomorrow)
Yeah, so I blew it. A required file was missing, one I hadn't added to the repository. Ouch.
Anyway, it's in, and as a bonus there are some (very) minimal docs in the docs/ directory that I've cleaned up a little, the README file is mildly useful, and the garbage collector won't collect up and kill global variables. I regenerated the ops table too, so opcodes will have constant numbers. Not a huge deal, but stability of opcode numbers is nice. I'll probably toss and redo the table and bump the version number before release, since there's a lot of crud in there that I think I'll redo when I have time. (Like there are _vv _ss and _sv versions of opcodes, which made sense until I built the MMD system into tornado, and are now mostly just annoying fiddly things)
I probably ought to dig out some subversion plugin or other that'll automatically generate a changes file.
New version at http://www.sidhe.org/~dan/things/tornado-0.03.tar.gz. Feel free to have at it, and I think the comments on the blog are even fixed. (A mixed blessing, since that means blogspam again)
(And if you snagged the tarball before 10:30PM Eastern (I think we're GMT -500) re-snag -- it now builds on linux, and I tossed the damn redundant ops and went full MMD for dispatch. We have it, so we might as well use it, and I can add in static analysis later if I want)
So, I added in rudimentary garbage collection to Tornado. Nothing fancy, but enough to clean up after itself as its running. (Not as it exits yet, though -- still leaks then)
It only builds as packaged on OS X, and requires perl (but, then, if you're on OS X you have perl) and the docs are abysmal to non-existant, but if you want to play it's at http://www.sidhe.org/~dan/things/tornado-0.01.tar.gz. It's under the BSD license, so it's pretty much a go for it, good luck sort of deal.
It ought to work on Linux if someone kicks the config system a little. I'm working next on getting it to build on Windows, and that should be interesting only because the threading's so different there.
Don't expect much threading, though. This is, right now, entirely vector oriented, with the sole exception that calling into the interpreter to execute a function spawns off a thread and that function runs in the background. The cool tornado threading stuff comes next, but one thing at a time. Good chunks of vector functionality are still missing, and It's also not too friendly to program in either, but that's OK too. It does enough for me to hang more stuff onto it, and I'm fine with that.
If you play with it, expect to make friends with gdb, though. :)
And before anyone decides to mention it, yes, it does have a big function pointer table and do indirect function calls for opcode dispatch. There's enough indirection in place to fix that later, but opcode dispatch time is, bluntly, noise for Tornado right now if it's operating on any real data. Adding together two 100K element vectors (which can be done with a single op) eats more time in memory wait time than all the opcode dispatch overhead in your entire program.
This may change as the threading ops get added in and I start doing more work with threads rather than vector data, but then I go and massage the opcode function code generator to emit a switch statement or computed goto core or something. Or bite the bullet and write a JIT. Something like that.
Yeah, yeah, I know, it's been ages. That's OK, that's what RSS feeds are for -- to remind you that those of us who've gone missing have occasionally resurfaced.
Anyway, in my Copious Free Time (which is to say barely ever) I've been abusing Tornado, trying to get the basics of the thing to work. And... now it does. Woo! I'll grant, that's the sort of Woo! that comes from spending an order of magnitude more time on a dead-simple problem than the problem actually warrants (which means, of course, I was doing threads work) but I don't care. Woo! it is.
Okay, so Tornado doesn't as yet have garbage collection, and while it has most of the vector ops in there's no internal threading yet (merely external threading; Sending a function request to an instantiated Tornado instance spawns a new thread to run it), and some of the error checking's a bit dodgy. Which means it leaks like a sieve and falls over if you stare at it funny, but it works anyway. Besides, adding a GC's only an hour or two of work, so that's no big deal.
I've gotta admit, it's kind of cool to write a program that does fairly complex work on megabytes of data, when that program's all of twenty or thirty likes of assembly code and has no explicit loops in it.
I think this beastie's nearly ready for a release for folks to fiddle with, if they're so inclined. Or not, that's OK too.
One thing I have noticed that's really odd is that printf and its ilk don't work from inside Tornado on my OS X box. This is not a huge problem (I've made friends with GDB) but it is puzzling. They work fine from the driver program but not inside the engine code that lives in the .dylib. I have no idea why, and in practice it doesn't matter (since the engine can't do any IO itself by definition) but it's weird. I expect there's some odd flag or library I need to link in, or I should've waved the chicken widdershins around my laptop. Something technical like that.
The nice thing is that there's more than just a skeleton in place. Now there's a good chunk of meat on the bones, and a framework that makes it straightforward and pretty easy (courtesy of some of the lessons I learned from getting Parrot up and going) to extend things. This makes me... happy.
Of course, I expect getting this running on Win32 will make me Unhappy again, but that's OK, I can live with that. The code I need's in example form inside the perl 5 sources (and yeah, I know, that is a scary thought). Just gotta handle thread creation and joining, mutex creation, locking, unlocking, and condition waiting/signalling/broadcasting. (Which, I realize, is done a little differently under windows, but I've already abstracted it away a lot, so I'm fine there, I think)
While I don't know that I've explicitly documented it yet, the Tornado engine has no built-in way to do disk IO, network IO, send signals, or otherwise interact with the outside world. This is on top of the engine monitoring memory use, thread creation, and potentially CPU use, with it being able to nuke a thread safely at a moment's notice. A nice sandbox, as secure and manageable as I can reasonably design in.
That's all great, since it helps provide you with a safe engine, and since we're really shooting to be able to run untrusted code that we get from someplace we don't entirely trust. Agents and calculation engines and macro programs and whatnot, and it's nice to have some assurance you're not going to have your system compromised because you downloaded and installed the WhizBang Dive Control plugin for the Aerial Deuces flight simulator.
Of course, the bad part about making it impossible to talk to the outside world is that you can't talk to the outside world. Which is, sometimes, a problem. Sometimes getting data at function start and returning data at function exit isn't enough -- you really want to call out into your environment or send back intermediate results. And while that's possible to do this way, you have to have the environment essentially send in seeker threads looking for data to return. (Which is actually kind of a cool concept, if you think about it. Damned awkward to actually write code for, though)
To get around this, Tornado has a mechanism where the environment can register callback functions that threads can invoke to do things. What the callbacks do is up to the environment, so it's a good idea to make sure that your Tornado programs match up with what your environment exports, but since we tag things that need verification we can at least check function signatures for callbacks. This way the code can have some access, and the host environment can regulate that access and enforce any sort of policy it might want. (Rate limiting, destination filtering, or whatever) When in doubt, punt, right?
This does have some issues, of course. The host does have to register and manage things for the engine, but something's gotta do that, so too bad there, especially since in the simple case it doesn't have to actually do much, just proxy the request. That's the simple case.
The complex case is when the host program actually needs to do some work to satisfy the request, and the things that makes that difficult are, of course, threads. Tornado code probably can't just invoke the callback function -- Tornado's heavily threaded and most host programs... aren't. I fully expect one of the common cases is for a single-threaded main program to embed a Tornado engine to do its thing, and even in a multithreaded case you don't generally want to have random threads calling into yourself. (Yes, I know, if you're embedding a threaded engine and providing callbacks you should be careful. But you should floss after every meal and exercise regularly too, and we all know how often people do that)
For that, then, Tornado allows you to note that a callback isn't thread-safe. If you do, Tornado gets mildly clever.
While it's not made clear to the calling program, each Tornado engine instance has a master thread attached to it. When you start an instance the thread is created, and when you make a call to have the engine do something the call's made to that thread, which then dispatches the call. When any thread running under that engine makes a callback, the callback request is routed to that master thread, which then makes the call into the embedding program but... only when the embedding program has said it's ready for the engine to execute its callbacks.
Basically the embedding program calls a Tornado function that drains the queue of waiting callback functions, so those functions get called from within the context of whatever host thread is draining the queue. (This way the embedding program can do the things that have to be done in a specific thread, like GUI or database calls, from the right thread)
Alternately the embedding program can note that a callback is always safe (because, for example, the embedding program is multithreaded and has properly protected the code), in which case Tornado code will just call them from whatever thread in the engine that happens to need to make it.
How the embedding program waits, and how it manages callbacks, is a matter for another post.
So, this morning when I got off the train it hit me that making message passing really really lightweight is a Good Thing. Yeah, I know, withhold your "Duh!"s for a moment.
For the most part, up until now, I'd been looking at the communication between threads as involving a relatively large quantity of data and, therefore, as relatively heavyweight. In that case a few extra cycles isn't that big a deal, since you're making few inter-thread calls and slinging great masses of data (well, 4K or more) between them to be processed. And that's a fine way to view things, since it matches a number of use cases well. Large quantities of data matched with large amounts of processing on that data.
But...
It doesn't work at all well when your threads are communicating a lot but not exchanging all that much data. For example, if you're running a simulation where each 'object' in your simulation is represented by a separate thread, running mostly independently, and interactions between the objects is done via passed messages. You know, like when you've got a few hundred objects moving  at different rates, with different collision avoidance algorithms, and different preferred distances with other objects.
Looking at threads like this, as producers and consumers of messages, rather than as potentially parallelized functions, has some other ramifications as well. As a for example it means that we don't want to look at messages as if they were remote function calls, since they're really not. A function may well get a message somewhere other than at the beginning of the code, and may well send a message out somewhere other than at the end.
Now, both ways are perfectly valid, and each of them has some benefits and drawbacks.
An RPC-style way of looking at threads makes for more easily auditable code. This is a good thing, since it means it's easier to do machine inspection, and it's easier for programmers to write correct code. (Not actually easy, of course. Nor likely, I suppose, but one can always hope) It also makes the implementation relatively simple, since you can make function calls and message passing share code, and that code doesn't have to be hugely efficient. The downside there is that it makes message passing less handy as a tool for communication, both because of its awkwardness and its mild inefficiency. (Though whether anyone but me cares about the inefficiency's always an open question) Coding this way tends to reduce the number of deadlocks you run into as well, which is a good thing.
A communicating agents way of looking at threads makes for more interesting code, which is good. Your code can be more flexible (or at least it's easier to be flexible) and it conforms to a relatively poorly served programming paradigm. The downside is that auditing and writing the code's potentially a lot more difficult.
Given how I want Tornado to behave and one of the places I want it well-suited to, we're heading to the lightweight message passing scheme. More of a pain for me, but I think the wins are worth it.
It seems like a truism, but no matter what you do, if you're working at a low level you're going to find parameter passing to be a major pain in the neck. (by which I mean it's easy to find a common scenario where your parameter passing scheme is going to be annoying and/or slower than it ought to be) Tornado, of course, is no different.
If you're going to make function calls, you need to pass in parameters. That's pretty much a given. And parameters, well... they're a problem.
Basically, how do you pass in parameters? And how do you get them in when you're a function? Heck, what exactly counts as a function anyway? (That is, where's your boundary for a 'real' function, as opposed to a spot you just jump to?)
There are a few ways to pass them in. If you're running on a stack-based system you can push your parameters onto the stack and jump to your function, but there are problems there if you're putting other things onto the stack like the return address. It's also potentially kind of slow, since you've got the cost of putting something onto the stack, which involves checking for stack depth as well as actually putting the data onto the stack (which can be pricey if you need to extend the stack in a segmented stack system, or runs the risk of throwing an exception on stack exhaustion in a fixed-sized stack system)
If you're in a register-based you can stuff the parameters into registers, but then there's the issue of what happens if you have too many parameters to fit into registers? Where do they overflow?
And there's always the possibility of packaging up the parameters somewhere else, in some out-of-bad location regardless of the existence of a stack. That's got the mild benefit of a register machine, since it's out of band and presumably not going to give you stack sizing issues, but you need yet another structure to store your parameters in, and getting to them requires special ops. (Presumably register access in a register machine, or stack access in either a register or stack machine already works)
Tornado's got an extra problem in that it's heavily threaded and built for speed and safety. Now, most of Tornado's data structures are write-once (which makes life ever so much easier in a threaded environment, though GC is always rearing its ugly head) which means that if we hand-wave on the GC issues we don't have to make copies of things to send to remote threads -- we can send the things themselves. And since Tornado doesn't have any global writable data (remember, we can store data into global 'slots', but it's explicitly an obliteration of the existing entry, and an atomic operation) there's precious little shared state to deal with.
I'm also just fine putting a hard limit on parameter counts for function calls. Want to call with thirty parameters? Well... too bad, you can't. Or I'm at least fine saying you can't.
The interesting thing here (well, one of them at least) is that the speed of your function calls can have a significant effect on the style of programming people use on your system, and the style of programming people use can, in turn, alter how optimizations are made for call speed.
For example, in OO programming you want function (well, method, but...) calls to be fast, because people seem to insist on having a huge number of tiny little methods. Personally I blame the lack of automatic accessors as part of the language design, but that's neither here nor there. Anyway, if you look at OO code you'll see a half-gazillion methods that are one or two lines long. Method calls have to be fast, since if they're not your code just runs horribly slowly. And yeah, it means that you will go to some lengths for this, and sacrifice speed in other places to make your methods go faster. (In this case, it means probably going for a stack-based system, since stack systems are almost universally faster for small-count calls, even if being stack based costs you performance elsewhere)
On the other hand, if you're writing code in a language that doesn't do much in the way of function/procedure calling, it doesn't make nearly as much sense to optimize for calls. Does it really matter if your call takes 10 msec instead of 5 if the language or coding paradigm you're working towards will make maybe a dozen function calls over the lifetime of a block of code? Probably not, and that's fine.
For Tornado, if you think about it, subroutines are pretty heavyweight. Yeah, sure, your sub might just be:
a = b *c
a = a % 4
a = a ^ d
but if a, b, c, and d are 1024 element arrays, well... that could take a while. And no, I've no clue if that does anything useful or not, but it's not like the guts of a gaussian blur or RSA encoding function are all that fancy if you've got automatic vector operations.
Then there's the question "what are all these subs going to do?" I mean, we've got a relatively specialized processing engine, one with a programming style that's likely going to annoy most people, so is it too out of line to figure that most of the subroutines are going to be reasonably large computationally, if not in actual program text?
This also brings message pools into the equation. (No, really, it does) Should fetching a message be the same as getting called with parameters? If so, you could arguably have a bunch of threads waiting on a thread pool's message queue and another thread posting a message is the same as making a 'remote' procedure call into a thread in the pool. If your global data's read-only, who cares where you live outside of the obvious speed and latency issues?
It's very tempting to treat all function calls as if they were messages being sent to other threads. More to the point, exposing it so that functions create messages to make a call, wait on their return, extract the return values from the return message, and get messages in when they're being called.
Tempting, but... no, at least not user-visibly. Might do it under the hood but that's just an implementation detail, and for right now we're not letting those get exposed.
Anyway, what will we do? Again, looking over the list 'o features:
Given all that, it seems clear we want to restrict messages and parameters to the same count of things. I'm thinking 16 is a good number -- certainly no more, as that's 64 bytes of data to copy for each function call. Each function activation gets its own activation record, which can store the function's pad, its registers, and the return address, and with that we won't need a stack. We're also going to strictly enforce declared in and out parameters, and leave the rest untouched, so if a call takes two parameters and returns one, only R0 and R1 go in, and only R0 is touched in the caller on return.
I think a small piece of Clever is in order here too -- when making a call the parameters go in the low registers (starting with 0) but when you're activated your parameters go in your high registers. That is, if function X takes one parameter, I put it in R0 when I make a call to X. When I'm in X, the parameter is in R16. Why? Well, assuming functions that don't have all that much code in them it means there's a good chance that most of your operations will be on the data you're passed in, and you'd rather not have to shuffle that around to keep it from being stomped on by a function call.
Messages will be limited to 16 elements as well, when they're constructed. This may end up being a bigger hassle, but for now I think it's OK. That's for another time, though.
This isn't, I'm sure, anyone's ideal scenario, but it seems like it'll handle the common case the way it ought to -- quickly, and with as little fuss as reasonable. Now to get it all implemented so it can be tested...
It's always useful to frown at your assumptions when you're doing design work, since they so often go unexamined, and often force you to build in limitations you hadn't expected in your software. (And sometimes they force you to build in features you hadn't expected, which is nice) Those assumptions also force tradeoffs, which are inevitable -- nothing's free, alas.
One of the base assumptions that Parrot had was extreme dynamism, in part because the languages it was targeted for are extremely dynamic. Code could change at runtime, variable types could change at runtime, and array sizes could change at runtime. (Indeed, the structure of your program could change while the code was inside a single opcode) That brought in all sorts of nifty things, but it also made life somewhat slow and annoying. Or relatively slow, at least, since if you embrace your assumptions rather than fighting against them you generally do pretty well.
Tornado, on the other hand, doesn't have the same sorts of assumptions that Parrot has, because it's not targeted at the same sorts of problems Parrot was. Parrot needed flexibility, while Tornado needs speed, security, and concurrency. Tornado's really more of a co-processor than a CPU -- something you call from your main program to perform some relatively specialized tasks, while you leave the general purpose code to your mainline code.
Because of that we can make some assumptions. They shoot the sort of full dynamism that perl has way down, but that's OK -- you'd call Tornado from perl, rather than running perl on Tornado. (And yeah, there's a perl interface module for it, though the engine itself doesn't yet function. Go figure)
So here are the ground rules for Tornado.
#1 means that we can check vector lengths on variable fetch and be happy. Vectors may be of indeterminate length until their first assignment or use (so we can mark a parameter as being a vector but not knowing how long the vector might be at compiletime) but once it has a size the size is fixed.
#2 means that attempts to turn an integer variable to a bigint will fail
#3 means we don't deal with mixed-type vectors. All the elements are ints, or floats, or whatever, and we can store them as efficiently as possible
#4 means that we don't carry around metadata in the bytecode. No variable names, no attributes beyond type, no annotations, nothing. Maybe we'll lift that later, but probably not. The only thing we keep around is the minimum amount of information to allow potential bytecode verification.
#5 is one of those 'duh' things -- if it's a vector it's only got one dimension. (Otherwise it'd be a tensor) This limitation'll probably be lifted later, though it affects #6.
#6 is a flat-out dodge of a lot of hassle. If you add two vectors together and one's longer than the other, what do you do? Repeat? Zero fill? One fill? Well, Tornado pitches an exception. When (well, if) vectors become tensors, then we'll probably have to do something fancier, what with the way multidimensional things operate, but that's for later.
#7 is because, well, we're not a text processing engine. That's really not the point, though I can certainly see doing Clever Things with parallel algorithms and text searching, but... maybe another time. As far as Tornado's concerned there is no text at all, and if you want to do text-y things you can do them elsewhere. (And yeah, I know, Tornado's well-suited for some specialized text processing applications, but let's face it, crypto doesn't deal with text, it deals with streams of numbers that happen to be also text. I don't think anyone's got a crypto algorithm that does Special Things with multi-codepoint characters and if they do I really don't want to know)
#8 means that if we want to turn a float to an Int4 you have to do it explicitly. None of Perl's mighty morphing power scalars. They're good for some things, but not what we do. We might allow you to do operations with two types where one type is a subset of the other (Int8 and BigInt operations, or Int4 and Float operations), but the destination assignment had darned well better be lossless. Try and store a BigInt into an Int4, or a BigNum into a Float and you lose unconditionally.
#9 is there to allow code to get around the rules for #8 -- sometimes you can know that your results will fit into an Int4, even though there are floats and bigints involved, but if you do (or don't care) then you have to force an explicit conversion.
Finally #10 means that promotion rules depend on the type of the variables involved, not on their values. If you multiply an Int4 and a BigInt you'll get a BigInt, even if the Int4 has a value of 3 and the BigInt a value of 2.
The one potential caveat here is that we may allow for temporary casting of byte vectors to other types. That is, while a vector may hold a wad of bytes, you may want to treat it as a vector of Int2, Int4, or Int8s. (No Int3s -- If you're doing video data transforms, go for a 4-element format like CMYK or RGB with an alpha channel)
Yeah, not having dynamism, and code that changes based on circumstance, and dynamically loadable code, and a handy language compiler built in are all limitations, but they're limitations I'm fine with, since they serve a purpose, and Tornado's very much not a general-purpose VM.
So, I've been busy with other things, and Tornado's been ignored of late. That's done with, and it's time to get back to the machine. The delay has let some interesting ideas kind of percolate, though, which I'm happy about. Specifically dealing with distributed computing, which actually dovetails nicely with some of the other things that Tornado does.
Tornado's basic architecture -- a strongly sandboxed VM with isolated threads communicating via passed messages -- extends nicely to a distributed system. If the system's sandboxed well it's safe to fling code around to remote machines (and safe to receive it). If individual threads are mostly isolated there's not that much time lost synchronizing threads and their data. If you have thread pools that share a message port then it's easy to fling a lot of data at a remotely hosted pool so it's ready when the threads need some.
Tornado, handily, is strongly sandboxed, its threads are pretty damn isolated, and it's got thread pools. Woohoo! :)
Silly exuberance aside, it's not really that easy. While multithreaded programming and distributed systems have some things in common, you can't really treat remote and local consumers and producers identically. Bringing in remote threads adds in a whole new set of potential failure modes (most of which are actually recoverable, which is nifty if you're building reliable systems) and of course there's the issue of latency to take into account. Sending a message to another thread in the same process may take a microsecond, sending it to a thread in another process on the same machine may take a dozen or more microseconds, and sending it to a remote thread could take milliseconds (or seconds, or possibly minutes if you're really unlucky and have lots of data). Sure you could treat it all the same, but... I dunno, I think it's a bad idea to ignore potential sources of latency.
There are also quite a number of applications where you really do want to know. If you're doing crypto work (some of the algorithms parallelize and vectorize really nicely) you'd certainly not want to be slinging plaintext or keys across the network, or if you do it's because you explicitly chose to do so. If you're doing game programming you probably don't want the latency in communicating with a remote process. On the other hand, if you're doing numerical simulation work you may well want to take advantage of remote systems -- the extra millisecond or three to send data to a remote system's trivial if you've a computation that takes minutes, hours, or days, and being able to easily bring online extra computing resources is a very useful thing.
So, anyway, that argues for designing in the capability but making it more explicit -- programs can specify that a spawned thread can be remote, or a spawned threadpool can be remote -- and the VM can manage distributing them around as need be. (Which brings up the interesting idea of welding ZeroConf into the VM to allow it to track what other execution engines are around, but that's for another time) Possibly also allowing for message ports to be specified as external resources, though there's always the issue there of malice -- you can't necessarily tell what the remote system's running, and allowing it opens up a potential network port with the possibilities for malicious bytecode to attack remote systems, which is something that Tornado's trying to prevent.
Dunno if it'll make it in. It's an interesting thing to think about, though.
Y'know, there are days I realize I should just give up trying to avoid clever things I don't want to do and just do them and get it over with.
This is one of those days. I'm working basic math into the tornado engine this afternoon. Nothing fancy, scalars only, but I've got four basic data types: 4 and 8 byte integers, 8 byte floats, and bignums. (Currently bignums are punted to the GNU MP library, but that'll go away in a while because of licensing issues) Results are the largest data type, so an Int4+Int8 produces an Int8, and an Int4+Float8 gives a Float8. (Yes, I know, there are issues of dropping bits with floats so it's not quite the same, but if you've got one float in the math already you've pretty much guaranteed things are fuzzy)
So I'm poking around in the files here and start throwing the code in. Data types all have to be figured out at runtime, since I don't want to annotate the bytecode enough to make sure that compilers aren't lying about the code that's produced (and I'm not comfortable designing an annotation system that's secure in the face of people trying to be evil), which means... well, it means switch statements. Lots of 'em. Or if ladders, but pretty much the same difference. And that's just nasty.
I really didn't want to put in MMD. I'd decided not to, actually, because of the basic type system that tornado has. (Not even any objects here) But even in the face of just a handful of types, it's easier to bite the bullet and go MMD. Which I didn't want to do.
Dammit, this sucks. (Even if it is ultimately handy) It'll probably be a lot more interesting when it comes time to dealing with the vectors, though.
One thing that always caused me no end of headaches was dealing with data and threads with Parrot. Parrot was designed to be a mostly single-threaded system with the option to spawn off multiple threads, mostly because the languages that parrot was dealing with were, for the most part, single-threaded, and where they were multithreaded they were multithreaded really badly. Tornado's different -- it's supposed to be massively multithreaded, which definitely changes the way you want to look at data.
In a mostly single-threaded case you want mutable data, since that's fastest. Having to create a new X every time you modify the old X gets slow. Lots of time's spent allocating data, copying data, and cleaning up the old copies of data. Bleah. (The Java folks are well aware of this) The downside there is that if you've data that multiple threads share you need to lock that data even if you're not going to modify it, since you can't be sure that another thread's going to modify it.
In the multithreaded case, you want immutable data, at least as much as possible. Multiple threads can access the data safely with no synchronization, since there's no way the data can change, and that's fine.
The choice of mutable vs immutable data really is one of relative threadedness -- the more multithreaded you are, the more the synchronization costs outweigh the copying and GC costs. Basically you've got to take an educated guess about how things are going to work (or trot out your particular dogma) and choose. Then you get to deal with the repercussions of that choice.
For tornado, we're going with basically immutable data. Machine operations generate new data elements, rather than reusing old ones, and variable actions are all explicitly stateless. (That is, you fetch data out of the variable store, but once you do the data you've fetched has no attachment to the variable name any more, and another thread could store something different into that variable and you'd not know, which is different than what Parrot did) That simplifies the engine design a lot, and removes many potential race conditions, which is a Good Thing.
Interestingly, this also means that the engine's pretty inherently SSA, which makes a number of bytecode optimizations possible that weren't with parrot, because parrot had such a massive number of potential side-effects. That's a separate topic, though.
Tornado's also going with a generic by-reference data scheme, that is all data is accessed by reference, and a data element has a type and payload component. That makes things a little simpler, since there's only one 'real' data type, a pointer to a data element. Going for variable-sized data elements (4 byte int, 8 byte int, float, string, whatever) would allow things to be a smidge faster, but we're constrained there by our guarantees of safety as an embedded engine, which means we've got to be a little paranoid for now, since we can't guarantee that a bytecode program is correctly generated, and we don't have the tools we need to allow for unfettered code running that'd allow us to trap illegal operations. (Damn that lack of MMU and kernel mode access!)
That's OK, we can still manage.
Here's an interesting question I've been pondering.
What if function and subroutine calls all spawned off their own threads and ran independently?
That is, what if the code:
foo = bar(1, 2, 3)
baz = xyzzy(3, 2, 1)
actually spawned off a thread to run the bar function, and a thread to run the xyzzy function? And both ran independently until they were complete?
There's a sequencing problem, of course. If the functions take a while, then this code:
a = foo()
b = bar()
c = baz(a,b)
could be rather problematic if we call baz before foo or bar have finished. On the other hand, if we know this can happen then we can design the VM or language to have tentative variables -- that is, variables that will get a value at some point, but don't yet have one, and any attempt to read from them pauses a thread until the value arrives. We'd get free sequencing that way, which is interesting and useful.
A second problem is one of global data. If any function call could spawn a new thread, and we could have a lot of threads running at once, we can't guarantee that there's any sort of synchronized access to global information -- we've got this big pot of asynchrony, after all. The easy answer there is just no global data, but that's a touch restrictive, even for me in these limited contexts. (Just barely, though)
Strongly discouraging mutable global data is a help here. If the global data is essentially static, then sequencing is irrelevant -- doesn't matter what order you read it in, the stuff's not going to change. It is very tempting to design Tornado's global data system such that you can read global data, or add a new piece of global data, but you can't change global data once it's been created. I really, really want to do this, since it'd make life ever so much easier of the only global 'changeable' things were sequence points, and I think there's a big difference between a global thing you coordinate activity on (basically a kind of POSIX condition variable) and a generic shareable wad of data.
There's a good reason for doing this thread-per-function-call thing, at least in special circumstances. Remember how Tornado's supposed to handle great wads of data in addition to heavy threading? Well, ponder this (ignore the horrid language syntax)
Declare Variable Vector foo = [1, 2, 3]
Declare Variable Vector bar
Declare Function Scalar xyzzy(Scalar)
bar = xyzzy(foo):iterate
What I'm trying for, awkwardly, is to note we've got a function xyzzy which takes a scalar value and returns a scalar value. We've called it with a vector, and stuffed the return into a vector. That's look wrong except for that :iterate thing there. That's supposed to indicate that, since the parameter's a vector, we actually call the function once for each element of the vector and aggregate the results. If we spawn a thread per function call... well, that could be interesting there. Yeah, in a single core single CPU system it's probably not much of a win, but... in a system with more than one CPU core handy, that could be really interesting. (Like if you've an SMP system, or one of the dual-core systems that are out, or better yet a multi-chip multi-core system -- one of those dual-CPU quad-core desktop systems people are making noises about for 2007, say)
This is where things get interesting, definitely. There's more fun to be had, too, if you want fancy -- like, for example:
Declare Variable Vector foo = [1, 2, 3]
Declare Variable Vector bar = [1, 2, 3]
Declare Variable Vector baz
Declare Variable Vector plugh
Declare Function Scalar xyzzy(Scalar, Scalar)
baz = xyzzy(foo, bar):iterate
plugh = xyzzy(foo, bar):permute
The first call to xyzzy would iterate through the two vectors -- first call would have (1,1), the second (2,2), the third (3,3). The second call, the one marked permute, would instead call xyzzy nine times, with all the different permutations of the elements of the two vectors. Yeah, permutation explodes out the thread set pretty badly with large vectors, but of course these are all virtual threads, since we don't have to actually spawn them, just make it look like we did.
Permutation does make a good case for going up to full-blown tensors rather than just vectors, since that permutation example could reasonably return a 3x3 matrix rather than a 9-element vector, but for right now I'm still dodging full tensors.
Now, while this is an Interesting Thing with plain integers as the scalars inside our vectors, it gets actually exciting if the scalars can be large blobs of data. If, for example, you had a crypto system written this way, each scalar in your vector could be a block of data to be encrypted or decrypted. Or if you had a rendering system written in it, each scalar could be an image map for an object to be rendered, or a chunk of text to be searched for some particular substring (or that has some regex done across it, or something). Again, on a uniprocessor single-core system, not so big a deal. On a system with multiple cores... a bigger deal.
Like I said, I wouldn't do all my general-purpose programming in this environment, since really static global data is a pain for a lot of larger applications. As a core for a subset of operations, though...
And yeah, I understand the multiple layers of humor in the title.
The VM stuff I've been thinking about's now started to coalesce, and as such I've started doing serious design work and coding. (Yes, I know, design then code, but boilerplate and general infrastructure's pretty straightforward) The engine's reasonably special-purpose, but that's fine, I knew that going in.
It's now got a name too -- the engine'll be called Tornado, since we're taking large chunks of data, whirling them around in mostly similar ways, and later flinging them out. Causing a potentially large amount of havok in the process, of course. I do have a direct application for the engine right now, so there's reason to bang this out, and when it's ready it'll be available both as a standalone embeddable library and as a perl module. If I'm feeling enthusiastic I may make a ruby library too, but we'll see how that goes.
This engine will share a few features with Parrot, though honestly not that many. Some of the techniques in the parrot build are really useful (like the separate files with a specialized mini-language for opcode functions) and I'll use those. Architecturally there are some significant differences in the problem space, and since Tornado's space is much smaller and only partially overlapping with parrot, some of the compromises that went into parrot I just don't have to make. That's kinda cool, actually. Tornado programs will also be smaller on the whole (Dunno about you, but I sure don't want to be writing whole apps in an environment geared very much towards massively threaded processing of mostly vector data) which means we can make some cheating tradeoffs for speed and safety too.
More details'll go up soon as I hash out some of the still-fuzzy bits, and the subversion repository'll likely get opened up for anonymous checkout if I can't get a timely CPAN module release.