May 14, 2003

Registers vs stacks for interpreter design

I was pointed at the mono list (something I generally avoid through sheer lack of time, and the rise in blood pressure it usually evokes) the other day and went reading through some of the discussion on getting ruby ported over to .NET. There's a whole lot of that discussion that I'm going to skip, but one of the things that came up was the eternal "register vs stack machine--which is better" question.

Now, for Mono, of course, there was no question. It's a stack machine and they had absolutely no choice in that. Doesn't make any difference what they wanted, they were copying .NET, and .NET is a stack machine. Which doesn't change the validity of the question, though it does shift who you should really ask about it.

There is no real doubt, it's easier to generate code for a stack machine. Most freshman compiler students can do that. Generating code for a register machine is a bit tougher, unless you're treating it as a stack machine with an accumulator. (Which is doable, albeit somewhat less than ideal from a performance standpoint) Simplicity of targeting isn't that big a deal, at least not for me, in part because so few people are actually going to directly target it--I mean, come on, how many people do you know who actually try to write a compiler for something anyone would ever care about? The numbers are small. The other issue there is that many of the folks with compiler knowledge already are comfortable targeting register machines, as that's what all hardware CPUs in common use are.

But lets, for a moment, consider the two main cases of execution. First, there's interpretation. This requires very little in the way of knowledge of the target machine, and something that Parrot must do. (We have a very wide range of machines we must run on and, no offense, we're probably never going to write a JIT for the Cray Y-MP) There's no real runtime cheating that one can do, as almost all the runtime cheating requires generating native machine code, and for this case we're not doing that.

In both the stack and register case, accessing a register requires an indirect memory access and at least one math operation, either addition or subtraction. For a register machine you're adding the register number to the register file base address to get an effective memory address, and for the stack machine you're loading indirectly from the stack pointer address, then incrementing or decrementing that pointer and storing the difference. The register case saves in memory activity a bit, as there's no need to store the result, while the stack machine has to store the result, but on the other hand the register machine needs more bandwidth to get the register numbers into it, something that's implicit for the stack machine. Whether the additional bandwidth offsets the write is up in the air and system dependent, as it depends on cache access size (the registers and register pointer for the RM, and the stack pointer and top few stack entries for the SM, will be in L1 cache) and sync time--memory writes, in an SMP system, require a bit of cache coherence maintenance, which will sometimes burn a nanosecond or two.

The interpreted case isn't one that .NET really cares about, which is fine, as .NET bytecode is very obviously designed to be JITted. It's the only way to get anything close to adequate performance from it. This isn't a dig at the .NET folks--I actually think it's a very sensible design decision, given the very limited range of machines (x86 and maybe Itanium) they personally were going to care about. Parrot does need to consider it, which arguably hampers us some, but on the other hand it means we don't need a team of Highly Paid Professionals to get us up and running at reasonable speed on some random processor. (And on the third hand, there justaren't that many different processors any more. Unfortunately)

In the JITted case, things are a little different. And better for parrot, oddly enough. (Surprised me) Now, when we JIT (and I should do a WTHI entry on JITting) we turn the interpreter's bytecode into native machine code. With JITting, we need to consider both the naive case, where we're doing a mechanical translation, and the clever case, where we're doing some analysis of the incoming bytecode.

There is, first and foremost, a general win with JITting. Just inlining the code for the different machine ops buys a fair amount, as you cut out the function pre- and post-ambles on the called op, and the parameter setup and teardown on the function call. If you have small opcode bodies this can be a considerable win. (Factor of 10 type win, which is a lot)

In the naive case, the stack machine still twiddles its stack pointer and accesses all its data indirectly. It has no real option, as we're being naive. That pointer's probably moved from the interpreter structure to a register, which is good, though it might've been in one already. (Probably was) It does get the general JIT win, which is a Good Thing, and quite the speedup. Still, indirect memory access takes a bit of time, though far less than it used to. (These days it might put a stall entry in the pipeline, which may be used by other ops in the stream, where in the old days an indirect memory access might cost you an extra cycle or two)

The register machine, though... The register machine has suddenly gone from indirect access to direct access to the memory, and the instruction stream has gotten denser. Remember, the register number is a constant. Also remember that the registers themselves don't move. And consider that in the common case you have only one interpreter, so there's no reason to not JIT the code just for it. Which adds up to turning register numbers into absolute memory addresses. Not even indirect ones--no "base pointer plus 4" or anything, but direct "load or store from absolute address X". The register file doesn't move for an interpreter, so we know when JITting where each register is, so we just encode it in the instruction stream. Yeah, it does mean that we need to re-JIT for each interpreter, but as I said the non-threaded case is the common one for us. Still, it's all memory access.

In the clever case the overhead for a stack machine is lessened some. The incoming instruction stream can be analyzed, so a sequence like:

push a
push b

can be turned into "put a in a native register, b in a native register, add 'em, and put the result back on the stack". Which is what a basic optimizing compiler does, and what a JIT in the clever case is--an optimizing compiler. There's less stack pointer twiddlidge in this case, as you can skip twiddling it for everything but the final push, and if that intermediate result is used somewhere else and not stored you might be able to dodge that stack access too.

There's a good win for the clever case for register machines, though. (Unfortunately we lack the personnel to do it for Parrot--we have the people who can, just not the cash to actually let them do it and still eat) to get that win for the stack machine you need to analyze the instruction stream and pick out what the temporaries are. In the register machine case, those temporaries are explicit--that's what the registers are, a set of temporary values. More to the point it's easier to track register usage in the common cases than it is intermediate stack results. (Not, mind, that either is tough, but since JITting is done at runtime, every millisecond counts)

Finally, some of the compile-time optimizations can be done at bytecode generation time for the register machine, where they have to be done at JIT time for the stack machine. Consider this:

c = a + b
d = a + 4;
e = b * 3;

Five variables, three statements. We'll assume, for the moment, that all the variables are 'objects' of some sort, and not just plain ints. What's the stack machine instruction stream?

push 'a'
push b'
push 'c'
push 'a'
push 4
push 'd'
push 'b'
push 3
push 'e'

19 ops, 9 parameters. And for the register stream?

getvar R1, "a"
getvar R2, "b"
getvar R3, "c"
getvar R4, "d"
getvar R5, "e"
add R3, R1, R2
add R4, R1, 4
mul R5, R2, 3

8 ops, 19 parameters. (oddly enough, one fewer total thing in the instruction stream than in the stack machine case, though not a win as such for parrot as each think for us is 32 bits, where each op for .NET is 8 bits, so we're using 108 bytes to .NET's 55 (assuming parameters are 32 bits, which may be inaccurate for .NET--the small ints might be smaller, but we can make them all 2 million to make sure they're 32 bits if you really want)) And note that we only fetch the pointers for a, b, and c once, and don't have to do an explicit "put the stack top into variable" op that a stack machine does. (Though you could certainly put the address to store the result on the stack as part of the add, rather than putting the result of the add on the stack. .NET doesn't do that, though)

In the naive case, we JIT and execute less than half the ops that the stack machine does, and have a denser instruction stream for it. In the clever case, the JIT for the stack machine can figure out that a, b, and c are used multiple times and skip the multiple fetches, certainly, but that figuring needs to be done at runtime, every time the program is run, where in the register machine it's done at compile time, and every time the program is compiled. Generally we don't compile more often than we run, and usually less. (Often much less)

Now, there are some vagaries of Parrot and .NET specifically to contend with--with enough effort the fairly obvious inadequacies of either system can be dealt with, and I'm sure someone "Hi, Miguel!) will chime in with specifics. Parrot's got a good win over .NET for interpretation, as that's what we're designed for, while .NET's got a simplicity win of sorts in the JIT case, and, on Windows at least, has a huge personnel win over parrot. (Rag on Microsoft all you want, their compiler guys are first rate) Mono's got a smaller win, though still a win there.

Still, it ought to be fairly obvious that register machines aren't the vile and horrible things people make them out to be.

Posted by Dan at May 14, 2003 09:01 AM | TrackBack (4)

A paper from the Inferno folks about the design of their virtual machine touches on the stack vs. registers topic.

Posted by: Daniel Dunbar at May 14, 2003 10:18 AM

Some systems do both :-)

Poplog has a high-level language independent stack based VM (the Poplog Virtual Machine - PVM). This is then compiled down to a low-level platform independent VM (Poplog Implementation Machine - PIM) that's more register minded (kind of VAXish CISC machine). The PIM is than compiled down to machine code.

Poplog uses this scheme to implement CLOS, ML, Prolog and Pop-11 (with incremental compilation to boot).

Posted by: Adrian Howard at May 14, 2003 10:35 AM

"What The Heck Is: the Stack vs. Register Holy War"

Interesting points... I'd like to read more about the state of the art in register-allocation algorithms versus the work done in stack-machine architectures (peephole optimizations, top-N-of-stack caching tactics, inlining, control-flow joining). Any suggestions?

Posted by: Gnomon at May 14, 2003 12:13 PM

Well, the bell labs report above's a good place. Beyond that the problem is that most compiler literature and research targets register machines since that's what the hardware is, and any stack stuff is generally dealt with as part of the HIR or AST the parser builds. That'll probably change as time goes on, what with the emphasis on the JVM and .NET that's happening, but...

If someone else has some good references I'd love to have them as well, for the next time I have to do this.

Posted by: Dan at May 14, 2003 12:24 PM

I mean, come on, how many people do you know who actually try to write a compiler for something anyone would ever care about?
Well, there are hundreds of languages out there that just need a VM to run on:-) Seriously, .net allows the creation of code at runtime, so it's not just compiler writers, but any people that want to have a specialized version of a function for example. Regular expressions can be compiled to IL code, as well as the code needed to serialize/deserialize objects.
...very limited range of machines (x86 and maybe Itanium)..
The MS people have a version of the CLR that runs on more processors than that. We at mono will have a complete jit for ppc soon and one for ARM has been started as well (and more are talked about:-).
More to the point it's easier to track register usage in the common cases than it is intermediate stack results.
Well, an intermediate stack result can be though as a register that can be assigned just once, while in the register machine you have to scan all the code in the method to check if the value in the register is overwritten...
Anyway, as I also wrote in the mail, if you're designing an interpreter, a register machine makes sense for execution speed reasons (code generation for it is still an issue, though, as well as the compatibility issues that may arise with time with the call conventions).
But if you decide that the primary execution engine is a JIT (with x86, ppc and sparc covering 99% of the market) the more compact and simpler stack bytecode is a big win, because you consider it just as an intermediate language. When you have the JIT, there is no evaluation stack to care about anymore:-)
The inferno paper is interesting, but it focuses mostly on interpretation. The only argument it gives against a JIT for a SM is the need to track the types (the JVM has untype local variables). On the CLR the locals are typed, so there is no such issue. And what it doesn't address is the issue of mapping the VM registers to hardware registers: most of the time the latter are scarce (think of the ugly x86 platform that dominates 90% of the market...). Since that issue is the common case, there is little difference between a SM and a MM: both have to do the same kind of complex register allocation to perform well.

Posted by: lupus at May 14, 2003 02:28 PM

For pure, non-JIT interpretation, if your register decoding is simple, the decrease in opcodes is in my experience a huge win on any significantly-pipelined processor, because of the high cost of mispredicted branches on instruction dispatch (which are extremely common even if every instruction has its own dispatch point).

I believe I first saw this pointed out by John Carmack when he was talking about his implementation of Quake C, which was a little C-like language for "scripting" in the game Quake back in 1996 or so.

Posted by: Sean B at May 15, 2003 04:23 AM

There are very few people who ever write compilers to target any machine--to a good first approximation that number is zero. (Which, I suppose, makes both you and I imaginary, Paolo... :) Even factoring in the 'toy language' designers, the number is very, very small. The number of people who use compilers or the output of compilers, though, is very large. Folks writing serializers or regexes don't touch the underlying code. And, for those people who are actually writing bits of bytecode assembly, well, it's been my experience that a register system's much easier to handle when writing manual code. YMMV, of course.

As for .NET's targets, well.... Microsoft is, first and foremost, interested in the x86. Period. Their .NET runs under Windows, and Windows runs on the x86. (And yes, I am aware of the late, lamented NT ports to MIPS and Alpha, may they rest in peace) Maybe on the Itanium. Some day. I am aware of the interest and engine on some of the handheld devices, but the performance of most (if not all) handhelds is just too damn slow to be a viable target. That doesn't mean that they haven't ported the .NET engine to other platforms (and certainly doesn't take away from your impressive effort to get a .NET engine running in many places) just that the other platforms aren't important enough to demand any sort of attention be paid to them in the design process. And it's the design process for the core of .NET in the first place that's important here, something that's long over.

JITting for a register-rich machine is, with a register-rich IL (Which one could consider a register based VM, the same way that a stack-based VM is a stack IL), is relatively easy, in part because a lot of the optimization decisions have been made, and in part because less information has been lost. While it's certainly possible to reconstruct that information by code analysis, or annotate the bytecode with it (not that you can necessarily trust the annotations in a secure environment, as people both make mistakes and lie) that's a non-trivial task and one that takes a good amount of JIT time and effort. While it takes a similar amount of time in a register machine, it gets done at compile time, and thus once, rather than at run time, and thus every time the JIT runs over the bytecode. (OTOH, you may generally feel that Moore's law makes the time irrelevant, and that's fine. I'd disagree, but that's also fine)

You are right, though--when translating to the register starved x86 there's often not a huge amount of difference in speed in the post-JITted code, regardless of whether it's from a stack or register VM. It's pretty much a wash. On the other hand, on a register rich machine the register VM has a decided advantage. (And even the x86 world is, slowly and painfully, moving to a register rich, or at least not pathetically under-endowed, model)

So another way of looking at it is that there's no win either way on the x86, but a register win on the non-x86, so why not go register based? A win is, after all, a win.

Posted by: Dan at May 15, 2003 02:12 PM