March 22, 2004

Converging on a first approximation of performant

Or whatever the word for "not pathetically slow" is.

This is a post-mortem on one of the design decisions of parrot, how its mutated from the beginning, and the reasons why. In this case, it's about the stacks.

To know why something is the way it is, it's important to have a handle on the history. Sometimes, of course, that boils down to "Seemed like a good idea at the time" but those are the uninteresting ones. Just picking the answer out of the air, well.. those you expect to change, unless you've got amazingly good instincts, or are darned lucky.

While Parrot was a register-based design from the start (on purpose and with forethought, but that's an explanation for another time) everyone needs stacks--even though most languages are completely lacking in stacks as an exposed fundamental and essential component.1 And one of the things that invariably happens with register designs is you need to save large wads of data to the stack. Either when making a call or having just been called, registers need to be preserved. They're your working set, and presumably the contents are important. (If not, well, why bother in the first place? :)

Parrot's got several register types. This isn't anything particularly new--most processors have at least a split between general-purpose and floating-point registers. Because of GC concerns we split the general purpose registers into integer, PMC, and String as well. I'd toyed with the idea of a flag on each register marking its type so the various routines could use it appropriately, but that seemed both error-prone and slow, since we'd be twiddling flag bits on every store, and checking them on every load and every run of the DOD. (I also worried about the possibility of security breaches on mis-set flags) We've a general-purpose stack and we could just push each register onto the stack as we needed to, but... our stack's typed, and that'd take up a lot of time. Besides, generally you'd just save all of one type of register, rather than one or two. (Since the major use of the stack would be to save temps across function calls) Thus the register backing stacks were born. This was part of the basic initial design.

With each register file having a backing stack, the next question was what to do with it. This is where the trouble we have now began.

Copying data is more expensive than not copying data. Pretty much a given, really, as it's a lot faster to not do something than it is to do something. For this I had an Aha! moment--why not have the top of register stack and the register set be the same place? That is, rather than having a spot where the registers lived for each interpreter, the registers would just be the current top-of-register-stack frame. If you have multiple frames in one chunk of the stack, well... so much is faster! Popping a register frame is just a matter of changing a pointer, and pushing a frame is also just a pointer change. A push that retains the current contents of the registers requires a memory copy, but that's OK--while you can pay the copying cost, you don't have to if you don't want to. The compiler can choose, for efficiency.

That made stack ops pretty darned fast, which was keen. The downside is that it required two pointer indirections and an add to get at a register, but no biggie, right? Well...

This was the point when Daniel Grunblatt (Now he's doing web traffic analysis work--can't hurt to pop over and see if he does what you need) proved me wrong by implementing the first version of Parrot's JIT. Daniel asked the fateful question "Can I make the JITted code specific to an interpreter instance? It'll be faster" I said yes, since most of the code parrot'll be running is single-threaded, one-shot, non-embedded stuff. There'll only ever be one interpreter for the life of the process. With that one assumption, Clever People (like, say, Daniel) will realize that if the registers are in the interpreter structure then the JIT can use absolute addresses to get to them. No math, no pointer indirection, just encode the actual memory address of the register. So Daniel asked if we could make 'em not move.

While I'm not always quick on the uptake, I can recognize a good thing when someone makes it blindingly obvious and smacks me in the head with it. We moved the register file into the interpreter structure, which threw out one pointer indirection on each register access, at the cost of requiring a memcopy to push and pop a register frame to the stack. Given that accessing the registers happens a lot more frequently than pushing and popping, and that most pushes would have a memcopy anyway, it was an obvious win. Sped up the JIT a lot (something like 10% IIRC) and as a bonus sped up the interpreter by about 3%. Whee!

We still had that chunked stack, though, that could have multiple register frames in each chunk. That still made sense, since we wanted to minimize the time required to save off a register frame, and with multiple slots in a chunk most saves were a counter test, memcopy, and counter twiddle. Which is pretty fast.

Enter Continuations. While we'd always planned on them, I didn't understand them, at all. It was a vague, handwavey "Yeah, we'll do those" thing. The folks on the LL1 list set me straight, and in time2 I became a convert. Continuations were everything we needed to wrap up state for sub calls. There's a lot involved in sub calls once you factor in security, loadable oplibs, lexical pads, and lexical global namespaces, and continuations let us wad that all up into one single thing, which was great--fewer ops, and future-proofing too, since we could add more to the context without breaking old code. (It's tough to save something you don't know about if you're forced to explicitly save individual things)

Continuations, though... those require some stack "fun". More specifically, you need to snapshot the stack at the point you take the continuation. Or, in our case, stacks. We could copy them all, but... yech. Besides being slow, we made the jump to a full continuation passing style and no matter how shallow the stacks we really don't want to have to copy them for each call. Slow, very slow. Instead we decided to use the COW system, and just mark the stack chunks as closed and read-only.

This unfortunately has three side effects. First, it means that each and every write to a stack chunk needs to check the COW status of the chunk. Ick. Second, it means that if the chunk isn't empty (which is likely) we'll need to copy it if we do anything to it--push or pop. Double Ick. Third, it means that most stack chunks will be at least partly empty, which is wasteful, what the original design was meant to avoid. (And we really bodged up the stack chunk COW too, so we were actually busted for quite a while)

That leaves us where we are today. We have a design that was originally very efficient for what it did that changed, because of changing requirements, to one that's slow and wasteful. Luckily we know about it before we got to the Pie-thon, so it can be fixed before pastry flies, which is good. Doubly lucky, we can actually fix it without any bytecode knowing the difference which is also good. I like behind-the-scenes fixes. Yay, good layers of abstraction.

FWIW, the new scheme, which is being twiddled on now, is to go to a one-frame-per-chunk stack--that is, each push to the stack allocates a new element. The nice thing here is that it totally eliminates the need for COWing the stack--pushes always allocate a new element which is guaranteed writable, while pops always fiddle with the stack pointer in the interpreter struct which is also always guaranteed writable. Skips the COWing altogether, which takes out a full code path in the push/pop, and since we can allocate frames using the PMC arena allocation code (which is also fast, something like 6-8 cycles to allocate a new PMC if the free list isn't empty) we don't have to write nearly so much code.

Stack ops still aren't free, but, well... what is?

1 Which is weird when you think about it. Phenomenally useful for implementation of a language and yet totally unused in the language itself. Go figure.
2 Though not in time for the first edition of Perl 6 Essentials. D'oh!

Posted by Dan at March 22, 2004 03:55 PM | TrackBack (0)
Comments

Thanks for the background to the decision. I had been trying to wrap my brain around many of these things recently and had come to similar conclusions, but was really curious why the stack top was not used directly for the registers (it seemed like a good idea to me). The JIT part makes a great deal of sense.

Posted by: Matt Fowles at March 23, 2004 12:27 AM

"not in time for the first edition" - indeed not. And the first edition seemed so sure of itself on this matter.

Posted by: Nicholas Clark at May 10, 2004 05:33 PM