I was pointed at the mono list (something I generally avoid through sheer lack of time, and the rise in blood pressure it usually evokes) the other day and went reading through some of the discussion on getting ruby ported over to .NET. There's a whole lot of that discussion that I'm going to skip, but one of the things that came up was the eternal "register vs stack machine--which is better" question.
Now, for Mono, of course, there was no question. It's a stack machine and they had absolutely no choice in that. Doesn't make any difference what they wanted, they were copying .NET, and .NET is a stack machine. Which doesn't change the validity of the question, though it does shift who you should really ask about it.
There is no real doubt, it's easier to generate code for a stack machine. Most freshman compiler students can do that. Generating code for a register machine is a bit tougher, unless you're treating it as a stack machine with an accumulator. (Which is doable, albeit somewhat less than ideal from a performance standpoint) Simplicity of targeting isn't that big a deal, at least not for me, in part because so few people are actually going to directly target it--I mean, come on, how many people do you know who actually try to write a compiler for something anyone would ever care about? The numbers are small. The other issue there is that many of the folks with compiler knowledge already are comfortable targeting register machines, as that's what all hardware CPUs in common use are.
But lets, for a moment, consider the two main cases of execution. First, there's interpretation. This requires very little in the way of knowledge of the target machine, and something that Parrot must do. (We have a very wide range of machines we must run on and, no offense, we're probably never going to write a JIT for the Cray Y-MP) There's no real runtime cheating that one can do, as almost all the runtime cheating requires generating native machine code, and for this case we're not doing that.
In both the stack and register case, accessing a register requires an indirect memory access and at least one math operation, either addition or subtraction. For a register machine you're adding the register number to the register file base address to get an effective memory address, and for the stack machine you're loading indirectly from the stack pointer address, then incrementing or decrementing that pointer and storing the difference. The register case saves in memory activity a bit, as there's no need to store the result, while the stack machine has to store the result, but on the other hand the register machine needs more bandwidth to get the register numbers into it, something that's implicit for the stack machine. Whether the additional bandwidth offsets the write is up in the air and system dependent, as it depends on cache access size (the registers and register pointer for the RM, and the stack pointer and top few stack entries for the SM, will be in L1 cache) and sync time--memory writes, in an SMP system, require a bit of cache coherence maintenance, which will sometimes burn a nanosecond or two.
The interpreted case isn't one that .NET really cares about, which is fine, as .NET bytecode is very obviously designed to be JITted. It's the only way to get anything close to adequate performance from it. This isn't a dig at the .NET folks--I actually think it's a very sensible design decision, given the very limited range of machines (x86 and maybe Itanium) they personally were going to care about. Parrot does need to consider it, which arguably hampers us some, but on the other hand it means we don't need a team of Highly Paid Professionals to get us up and running at reasonable speed on some random processor. (And on the third hand, there justaren't that many different processors any more. Unfortunately)
In the JITted case, things are a little different. And better for parrot, oddly enough. (Surprised me) Now, when we JIT (and I should do a WTHI entry on JITting) we turn the interpreter's bytecode into native machine code. With JITting, we need to consider both the naive case, where we're doing a mechanical translation, and the clever case, where we're doing some analysis of the incoming bytecode.
There is, first and foremost, a general win with JITting. Just inlining the code for the different machine ops buys a fair amount, as you cut out the function pre- and post-ambles on the called op, and the parameter setup and teardown on the function call. If you have small opcode bodies this can be a considerable win. (Factor of 10 type win, which is a lot)
In the naive case, the stack machine still twiddles its stack pointer and accesses all its data indirectly. It has no real option, as we're being naive. That pointer's probably moved from the interpreter structure to a register, which is good, though it might've been in one already. (Probably was) It does get the general JIT win, which is a Good Thing, and quite the speedup. Still, indirect memory access takes a bit of time, though far less than it used to. (These days it might put a stall entry in the pipeline, which may be used by other ops in the stream, where in the old days an indirect memory access might cost you an extra cycle or two)
The register machine, though... The register machine has suddenly gone from indirect access to direct access to the memory, and the instruction stream has gotten denser. Remember, the register number is a constant. Also remember that the registers themselves don't move. And consider that in the common case you have only one interpreter, so there's no reason to not JIT the code just for it. Which adds up to turning register numbers into absolute memory addresses. Not even indirect ones--no "base pointer plus 4" or anything, but direct "load or store from absolute address X". The register file doesn't move for an interpreter, so we know when JITting where each register is, so we just encode it in the instruction stream. Yeah, it does mean that we need to re-JIT for each interpreter, but as I said the non-threaded case is the common one for us. Still, it's all memory access.
In the clever case the overhead for a stack machine is lessened some. The incoming instruction stream can be analyzed, so a sequence like:
can be turned into "put a in a native register, b in a native register, add 'em, and put the result back on the stack". Which is what a basic optimizing compiler does, and what a JIT in the clever case is--an optimizing compiler. There's less stack pointer twiddlidge in this case, as you can skip twiddling it for everything but the final push, and if that intermediate result is used somewhere else and not stored you might be able to dodge that stack access too.
There's a good win for the clever case for register machines, though. (Unfortunately we lack the personnel to do it for Parrot--we have the people who can, just not the cash to actually let them do it and still eat) to get that win for the stack machine you need to analyze the instruction stream and pick out what the temporaries are. In the register machine case, those temporaries are explicit--that's what the registers are, a set of temporary values. More to the point it's easier to track register usage in the common cases than it is intermediate stack results. (Not, mind, that either is tough, but since JITting is done at runtime, every millisecond counts)
Finally, some of the compile-time optimizations can be done at bytecode generation time for the register machine, where they have to be done at JIT time for the stack machine. Consider this:
c = a + b
d = a + 4;
e = b * 3;
Five variables, three statements. We'll assume, for the moment, that all the variables are 'objects' of some sort, and not just plain ints. What's the stack machine instruction stream?
19 ops, 9 parameters. And for the register stream?
getvar R1, "a"
getvar R2, "b"
getvar R3, "c"
getvar R4, "d"
getvar R5, "e"
add R3, R1, R2
add R4, R1, 4
mul R5, R2, 3
8 ops, 19 parameters. (oddly enough, one fewer total thing in the instruction stream than in the stack machine case, though not a win as such for parrot as each think for us is 32 bits, where each op for .NET is 8 bits, so we're using 108 bytes to .NET's 55 (assuming parameters are 32 bits, which may be inaccurate for .NET--the small ints might be smaller, but we can make them all 2 million to make sure they're 32 bits if you really want)) And note that we only fetch the pointers for a, b, and c once, and don't have to do an explicit "put the stack top into variable" op that a stack machine does. (Though you could certainly put the address to store the result on the stack as part of the add, rather than putting the result of the add on the stack. .NET doesn't do that, though)
In the naive case, we JIT and execute less than half the ops that the stack machine does, and have a denser instruction stream for it. In the clever case, the JIT for the stack machine can figure out that a, b, and c are used multiple times and skip the multiple fetches, certainly, but that figuring needs to be done at runtime, every time the program is run, where in the register machine it's done at compile time, and every time the program is compiled. Generally we don't compile more often than we run, and usually less. (Often much less)
Now, there are some vagaries of Parrot and .NET specifically to contend with--with enough effort the fairly obvious inadequacies of either system can be dealt with, and I'm sure someone "Hi, Miguel!) will chime in with specifics. Parrot's got a good win over .NET for interpretation, as that's what we're designed for, while .NET's got a simplicity win of sorts in the JIT case, and, on Windows at least, has a huge personnel win over parrot. (Rag on Microsoft all you want, their compiler guys are first rate) Mono's got a smaller win, though still a win there.
Still, it ought to be fairly obvious that register machines aren't the vile and horrible things people make them out to be.Posted by Dan at May 14, 2003 09:01 AM | TrackBack (4)