March 30, 2004

Adventures in Garbage Collecting

One of the nice things about being within driving distance of people who're phenomenally smarter than you are is that you sometimes get the benefits of their research. (Though Citeseer's darned nice too)

For the interested, there's a seminar on garbage collection at MIT on Monday April 5th 2004. The announcement follows:

"A (Grand?) Unified Theory of Storage Reclamation"
Speaker: David F. Bacon
Host: Professor Martin Rinard
Host Affiliation: Computer Architecture Group

Date: 4-5-2004
Time: 3:30 PM - 5:00 PM
Refreshments: 3:15 PM
Location: Gates 7th Floor Lounge

The two basic forms of automatic storage reclamation, tracing and reference counting, were invented at the dawn of the high-level language era over 40 years ago. Since then there have been many improvements and = optimizations, but all systems are based on one or the other of these methods, which are uniformly viewed as being fundamentally different and possessing very distinct performance properties. We have implemented high-performance collectors of both types, and in the process observed that the more we optimized them, the more similarly they behaved -- that - they seem to share some deep structure.

We present a formulation of the two algorithms that shows that they are in fact duals of each other. Intuitively, the difference is that tracing operates on live objects, or "matter", while reference counting operates on dead objects, or "anti-matter". For every operation by the tracing collector, there is a precisely corresponding anti-operation by the reference counting collector.

Viewed in this light, we show that all high-performance collectors (for example, deferred reference counting and generational collection) are in fact hybrids of tracing and reference counting. We are developing a uniform cost-model for the collectors to quantify the tradeoffs that - result from choosing different hybridizations of tracing and reference counting. This will allow the correct scheme to be selected based on system performance requirements and the expected properties of the target application.

Dunno if I'll make it, but I'm sure going to try...

Posted by Dan at 03:10 PM | Comments (6) | TrackBack

March 23, 2004

Things to not do in code #18

I oughta collect these all up, but...

Never, ever do in two allocations what you can do in one. For example:

  struct foo {
struct bar *thing;
int this;
int that;
};

struct bar {
int data[128];
}

See the problem there? What happens is that you allocate a foo, then allocate a bar and put a pointer to it in the foo. Which makes sense if bar is of variable-length, but... in this case it isn't. Even worse is the case where bar has a pointer to a glob of memory which is of fixed-size.

Why does this suck?

First, it means you've got multiple allocations. Two or three, instead of one. That messes up your free list more.

Next, more allocations means more chance of failure. If you've got one big damn structure then allocate it as one big structure. Breaking it in pieces that are all required just means you've more chances to screw up the error handling.

Third, you waste memory. That thing pointer in the foo struct's completely irrelevant. Sure, it's only 4 or 8 bytes, but that's about a third of the structure in our case. (Which isn't even that degenerate as these things go) Adds up.

Fourth, you waste time. Getting to bar requires an extra pointer indirection action. A lot of time? No. But some, and difficult for the compiler and CPU to predict. You've a good chance of adding a half dozen cycles to access elements of bar because of it.

Fifth, you waste time more. Different allocations have a good chance of being in different areas of memory, especially if they're of significantly different sizes. (Many memory allocators have separate free lists for different sizes of memory, so a 32 byte allocation comes from a different pool than a 1024 byte one) That means you lose any locality of reference and odds are have to fault at least two pages from main memory into L2 cache, and guaranteed two separate cache lines into L1. The L1 fault may not be any different, though, depending on how much of each structure you access, so you may pay for the same number of L1 faults, but... You're also using more L1 cache because of the extra pointer.

Yeah, yeah, there's always that 'Well, it's conceptually clean' argument, or the "Separate things should be separate!" one. If that's actually true, great. Separate them. If they aren't, then don't, dammit! If it's a big single thing, make it a big one. Want to make it easy for yourself by breaking it up? Fine. In that case, the declaration above should be:

  struct foo {
struct bar thing;
int this;
int that;
};

struct bar {
int data[128];
}

See the difference? (Ignore the squawks from the compiler about forward declarations) foo contains bar, rather than points to bar. If that's how the structure's modelled, then that's how it should be declared. You still get that encapsulation you want, without the costs.

If you're tempted to launch the "Well, what if I need to refactor it?" retort, don't. If you were worried about that, then allocation and access would already be factored out into separate routines so it wouldn't matter, and trust me, the compiler will pitch a major fit if you miss changing dereference punctuation to structure accessing punctuation if you've not done that.

Posted by Dan at 11:58 AM | Comments (4) | TrackBack

March 22, 2004

Converging on a first approximation of performant

Or whatever the word for "not pathetically slow" is.

This is a post-mortem on one of the design decisions of parrot, how its mutated from the beginning, and the reasons why. In this case, it's about the stacks.

To know why something is the way it is, it's important to have a handle on the history. Sometimes, of course, that boils down to "Seemed like a good idea at the time" but those are the uninteresting ones. Just picking the answer out of the air, well.. those you expect to change, unless you've got amazingly good instincts, or are darned lucky.

While Parrot was a register-based design from the start (on purpose and with forethought, but that's an explanation for another time) everyone needs stacks--even though most languages are completely lacking in stacks as an exposed fundamental and essential component.1 And one of the things that invariably happens with register designs is you need to save large wads of data to the stack. Either when making a call or having just been called, registers need to be preserved. They're your working set, and presumably the contents are important. (If not, well, why bother in the first place? :)

Parrot's got several register types. This isn't anything particularly new--most processors have at least a split between general-purpose and floating-point registers. Because of GC concerns we split the general purpose registers into integer, PMC, and String as well. I'd toyed with the idea of a flag on each register marking its type so the various routines could use it appropriately, but that seemed both error-prone and slow, since we'd be twiddling flag bits on every store, and checking them on every load and every run of the DOD. (I also worried about the possibility of security breaches on mis-set flags) We've a general-purpose stack and we could just push each register onto the stack as we needed to, but... our stack's typed, and that'd take up a lot of time. Besides, generally you'd just save all of one type of register, rather than one or two. (Since the major use of the stack would be to save temps across function calls) Thus the register backing stacks were born. This was part of the basic initial design.

With each register file having a backing stack, the next question was what to do with it. This is where the trouble we have now began.

Copying data is more expensive than not copying data. Pretty much a given, really, as it's a lot faster to not do something than it is to do something. For this I had an Aha! moment--why not have the top of register stack and the register set be the same place? That is, rather than having a spot where the registers lived for each interpreter, the registers would just be the current top-of-register-stack frame. If you have multiple frames in one chunk of the stack, well... so much is faster! Popping a register frame is just a matter of changing a pointer, and pushing a frame is also just a pointer change. A push that retains the current contents of the registers requires a memory copy, but that's OK--while you can pay the copying cost, you don't have to if you don't want to. The compiler can choose, for efficiency.

That made stack ops pretty darned fast, which was keen. The downside is that it required two pointer indirections and an add to get at a register, but no biggie, right? Well...

This was the point when Daniel Grunblatt (Now he's doing web traffic analysis work--can't hurt to pop over and see if he does what you need) proved me wrong by implementing the first version of Parrot's JIT. Daniel asked the fateful question "Can I make the JITted code specific to an interpreter instance? It'll be faster" I said yes, since most of the code parrot'll be running is single-threaded, one-shot, non-embedded stuff. There'll only ever be one interpreter for the life of the process. With that one assumption, Clever People (like, say, Daniel) will realize that if the registers are in the interpreter structure then the JIT can use absolute addresses to get to them. No math, no pointer indirection, just encode the actual memory address of the register. So Daniel asked if we could make 'em not move.

While I'm not always quick on the uptake, I can recognize a good thing when someone makes it blindingly obvious and smacks me in the head with it. We moved the register file into the interpreter structure, which threw out one pointer indirection on each register access, at the cost of requiring a memcopy to push and pop a register frame to the stack. Given that accessing the registers happens a lot more frequently than pushing and popping, and that most pushes would have a memcopy anyway, it was an obvious win. Sped up the JIT a lot (something like 10% IIRC) and as a bonus sped up the interpreter by about 3%. Whee!

We still had that chunked stack, though, that could have multiple register frames in each chunk. That still made sense, since we wanted to minimize the time required to save off a register frame, and with multiple slots in a chunk most saves were a counter test, memcopy, and counter twiddle. Which is pretty fast.

Enter Continuations. While we'd always planned on them, I didn't understand them, at all. It was a vague, handwavey "Yeah, we'll do those" thing. The folks on the LL1 list set me straight, and in time2 I became a convert. Continuations were everything we needed to wrap up state for sub calls. There's a lot involved in sub calls once you factor in security, loadable oplibs, lexical pads, and lexical global namespaces, and continuations let us wad that all up into one single thing, which was great--fewer ops, and future-proofing too, since we could add more to the context without breaking old code. (It's tough to save something you don't know about if you're forced to explicitly save individual things)

Continuations, though... those require some stack "fun". More specifically, you need to snapshot the stack at the point you take the continuation. Or, in our case, stacks. We could copy them all, but... yech. Besides being slow, we made the jump to a full continuation passing style and no matter how shallow the stacks we really don't want to have to copy them for each call. Slow, very slow. Instead we decided to use the COW system, and just mark the stack chunks as closed and read-only.

This unfortunately has three side effects. First, it means that each and every write to a stack chunk needs to check the COW status of the chunk. Ick. Second, it means that if the chunk isn't empty (which is likely) we'll need to copy it if we do anything to it--push or pop. Double Ick. Third, it means that most stack chunks will be at least partly empty, which is wasteful, what the original design was meant to avoid. (And we really bodged up the stack chunk COW too, so we were actually busted for quite a while)

That leaves us where we are today. We have a design that was originally very efficient for what it did that changed, because of changing requirements, to one that's slow and wasteful. Luckily we know about it before we got to the Pie-thon, so it can be fixed before pastry flies, which is good. Doubly lucky, we can actually fix it without any bytecode knowing the difference which is also good. I like behind-the-scenes fixes. Yay, good layers of abstraction.

FWIW, the new scheme, which is being twiddled on now, is to go to a one-frame-per-chunk stack--that is, each push to the stack allocates a new element. The nice thing here is that it totally eliminates the need for COWing the stack--pushes always allocate a new element which is guaranteed writable, while pops always fiddle with the stack pointer in the interpreter struct which is also always guaranteed writable. Skips the COWing altogether, which takes out a full code path in the push/pop, and since we can allocate frames using the PMC arena allocation code (which is also fast, something like 6-8 cycles to allocate a new PMC if the free list isn't empty) we don't have to write nearly so much code.

Stack ops still aren't free, but, well... what is?

1 Which is weird when you think about it. Phenomenally useful for implementation of a language and yet totally unused in the language itself. Go figure.
2 Though not in time for the first edition of Perl 6 Essentials. D'oh!

Posted by Dan at 03:55 PM | Comments (2) | TrackBack

March 11, 2004

Object support moves along

We may have to cut a 0.1.1 release of Parrot soon, at the rate things are going. We have runtime loadable bytecode (though we need to get docs written) and more IMCC support for objects--specifically method tags for subs (which sets up the C name), namespaces in loadable bytecode (which makes loading classes at runtime so much easier), and syntax for method calls.

Freaky. Really. I'm going to have to get cracking and get the method cache and object internals sped up, at which point maybe we will cut a release...

Posted by Dan at 10:32 AM | Comments (1) | TrackBack

Too clever by half

On the one hand, I do appreciate that my ISP does security scans, presumably checking for vulnerable and compromised machines on its part of the network. Yay them. On the other, I really wish they wouldn't do this from their nameserver machines. Having all the machines in your resolv.conf black-holed for security violations makes life somewhat less fun than one might otherwise like...

Posted by Dan at 09:26 AM | Comments (1) | TrackBack

March 09, 2004

The joy of real work

One of the nice, if unexpected, side-benefits of having a Real Job (ObPlug: Yarde Metals, a quite nice place to work) that's using Parrot is that I'm writing real application code to use Parrot. That means I'm dealing with it on all levels, from the guts of the interpreter through data to compiling to it and hand-writing runtime library code for the thing, with an existing target application to test with. (The day job is to write a compiler for a language called DecisionPlus, which the Day Job's whole application suite is written in, that targets something besides the limited back-end that is currently used. Parrot fits the bill really nicely, which is cool)

You may think that this isn't that big a deal, and that there's really no advantage or difference in perspective between designing a back-end system with a compiler in mind and designing a back-end system with the whole chain to the end application in mind, but that turns out not to be the case. In part it's a matter of needs (end application code has different needs than general library code) and in part a matter of emphasis (I'm working on code that might otherwise not get looked at so much, or so soon)

Objects are a big example there. It's probably no secret, if you've read back a ways, that I don't like objects. Yes, I know, they're rainbow-colored, fruit flavored, come in a variety of pleasing sounds with a nice mild vanilla scent, but... Bleah. Can't stand the things. It's a deep, unreasoning despisal, completely lacking any rational basis, I'm sure it speaks poorly of me as a programer, and reflects badly on my underlying character. I can live with that. Objects suck.

Still... I need them. Not for the pie-thon--my dislike of objects outweighs the threat of pie. Work, though... for work, through a quirk of where things stand in Parrot, objects will make my life significantly easier. SIgnificantly meaning possibly several months overall, and a few thousand lines of C and/or PIR code, though a lot of that time savings comes down the road. And so... we have objects. And it's all Yarde's fault, dammit.

Yay us, I guess. Think of Parrot every time you walk on tread plate. :)

Posted by Dan at 05:07 PM | Comments (6) | TrackBack

March 08, 2004

Checkoffs for the Pie-thon

Just in case anyone's keeping track, here are a few things that are working:

  • Bytecode loading
  • Objects
  • Method calls
  • Operator Overloading

oh, yeah, and all that pesky math stuff, subroutines, and strings. They work too, though Unicode needs some help. And variables. We can do variables.

Still need some work on nested namespaces, actual bytecode translation, and object performance, but...

Posted by Dan at 02:04 PM | Comments (8) | TrackBack

March 05, 2004

More destruction!

So, I wasn't too clear in the last post about the problems with finalization. So here's some more detail.

If you remember, there are three ways to handle finalization. You can call every finalizer method in a dying object's hierarchy, you can call the finalizer method like any other method and count on code to properly redispatch, or you can disallow finalization methods altogether and instead pass in relevant bits of the object to an external cleanup routine.

Now, let's first assume that we can do redispatch properly. That means with a class hierarchy like:

    A   B
     \ /
      C

If A, B and C all have a method foo, doing SUPER.foo in C calls A's foo, and if that method then calls SUPER.foo then we'll invoke B's foo. (Note that perl 5's SUPER doesn't do that, you need to use NEXT instead) This is an important assumption for systems with multiple inheritance. (With a single-inheritance system it isn't an issue)

And, so we have a nice diagram to refer to, lets assume our class hierarchy looks like:

A   B  A   E
 \ /    \ /
  C      D
  \     /
   \   /
    \ /
     F

Note that A is in the hierarchy twice. We'll assume, for the sake of argument, that all the classes have DESTROY methods.

So, how do we finalize an object of class F?

Well... we could call the DESTROY method in F, whatever that might be called. That means we call F's DESTROY. If all the methods SUPER properly, that means we call them in the order F, C, A, B, D, E, or F, C, A, B, D, A, E, depending on whether we prune the tree or not. For the sake of argument we'll assume that we do prune the tree, as that's the common thing to do.

There is, right there, a problem. If we call finalizers in normal method order, then A gets cleaned up before D has a chance to clean itself up--if D depends on its parent classes being still in good working order (not an unreasonable assumption, even in a finalizer) then you're hosed because it isn't--it already cleaned up after itself. The sensible thing to do in this case is to have a separate traversal scheme for finalization. Yeah, special cases are sub-optimal, but it beats screwing things up. Of course, if the DESTROY method in E reparents the object into the root set, well... you're in some serious trouble anyway.

The big problem with this scheme is one of trust--what happens if some DESTROY method, say C's, decides not to call SUPER.DESTROY? Well... then you get an object that's only partially finalized, and its possible it'll leak some resources. (This, interestingly, is not common in practice since usually each resource that needs active finalization has a wrapper object so the worst that happens is that those objects get cleaned up by their own destroy rather than the finalize of the wrapping object. This is a good reason to not have one object wrap multiple bare things)

The big advantage to this scheme is that you can actually call your parent finalizers before you're done with your own finalization. At least, I assume this is an advantage. I dunno, I don't do objects.

The alternative means of doing finalization is to make sure we call all the finalizers for the object. No redispatch or anything, the object cleanup code just scans the hierarchy for an object and calls all the DESTROY methods. For our example class, we'd likely call the finalizers in F, C, D, A, B, E order, as we'd prune the tree and call from shallowest to deepest. If a class is in the hierarchy at multiple depths we use the deepest version of it.

This alternate method actually works pretty well, since you're insulated from potential problems in the tree--you don't have to worry that your child classes might forget to redispatch or something. It can also be done reasonably quickly, as you can actually cache the traversal order and methods in the class somewhere and not have to do any dynamic lookups, but that's not likely a huge issue. (If you have issues with dispatch speed for finalization you've probably got bigger issues) There's still the issue of reparenting--it's possible once again for that pesky E class to reparent the object that's now mostly dead.

The third way is to attach a closure of some sort to the object, or to each class, that takes some of the data out of the object. When an object is slated for destruction you yank out the bits and call the closure. The advantage here is that, unless your object is self-referential, there's no way to reparent the thing. (And if it is, the assumption is that you're really out of luck and reparenting will just fail, or die horribly, or something of the sort)

Each of these methods is fine. Personally I prefer the automatic calling of the finalize methods, so I don't have to worry about forgetting or screwing things up. (There's that whole 'encapsulation' thing--how can a child class have any clue as to whether my parent class should or shouldn't clean up? It shouldn't know, as it's just none of its business) The nasty bit for Parrot is the mix'n'match issue.

I'd love to choose just one, but I can't--we've committed to making perl 5, perl 6, python, and ruby work. Perl 5 and python use the "you better delegate right" scheme, Perl 6 uses the "call 'em all!" scheme, and ruby uses the closure scheme. So... what, then, do we do if we mix it up?

For example, in our diagram, lets think that F and B are Python classes, A is a Ruby class, and the rest are perl 6 or C++ classes. And, just to make it more difficult, B doesn't redispatch its DESTROY method. (After all, why should it? It's a top-level class) Contrived? Maybe. We are pushing Parrot's interoperability, though, so someone'll do it. And even if it doesn't get so bad, there'll be plenty of perl 5/perl 6 mixing, and you'll likely see perl 6 on the top and bottom with perl 5 in the middle in a lot of cases. (As perl 6 is used for new code, and people start refactoring old code from the bottom up)

The table looks like:

ClassTypeRedispatch?
FPythonYes
CPerl 6Auto
DPerl 6Auto
ARubyN/A
BPythonNo
EPerl 6Auto

The question, then, is... what gets called? And in what order?

The first thing that springs to mind is taking them by group--call the Python-style finalizer, then call the automatic finalizers, then call the ruby finalizers. That, though, will destroy the object out of order--there'll be classes where the parent class attributes are gone before the child class gets to do its thing. That's A Bad Thing. No joy there.

The next thing to do is assume that the C++/Perl 6 finalizers just automatically redispatch, as does the ruby finalizer. But... in that case, B's failure to redispatch means we never call E's finalizer. That's bad too.

The third thing to do is redispatch anyway if there are automatic finalizers left to run, even if a manual finalization doesn't redispatch. But... what if the finalization didn't redispatch on purpose? There may be a reason. (I'd not bet on a good one, but I don't get to not implement things based on judgement calls on other people's code, even if it really sucks. Still gotta make it work)

You could continue the redispatch if a manual method didn't redispatch, but only go for the automatic methods, which wouldn't be too bad, but still sits poorly for me. Ick. I don't, I'm afraid, have a good solution. There may not be one, in which case it's a matter of choosing the least bad. Or punting this to Larry, Guido, and Matz and letting them hash it all out.

Finalization. Who knew death would be so darned complex and annoying?

Posted by Dan at 04:55 PM | Comments (6) | TrackBack

March 01, 2004

So how do you kill these darned things anyway?

One of the problems with objects is that, like cats, they occasionally leave dead things under the sofa that you need to explicitly clean up. Generally called finalization (or incorrectly called destruction, but that's a separate issue) this cleanup is a handy thing, as otherwise you'd have all sorts of crud building up as your program ran. Finalization's usually used for things that aren't memory--filehandles that need closing, database connections that need closing, handles on external library resources that need some sort of cleaning up, or weak references you want breaking so other dead things can be found.

Like so many other object things, there's the inevitable fight over what the "right" way to do things is. So far as I can tell, there are three ways, to wit:


  • The C++ way
  • The python/perl 5 way
  • The Ruby way

The C++ way, which is also the way that Perl 6 is going, says that "If a class has a finalize method, you call it". If your object is part of a class that inherits from 5 or 6 different other classes all with finalization methods, well, you call 'em all. Generally you'll call them from most to least derived in the hierarchy, so you don't destroy any underlying structure before the top bits of the object are tossed. This scheme apparently has some issues, mainly I expect if you're looking to completely take over the process of destruction from a parent class. Whether or not you should do this... let's not go there.

The Python/Perl 5 way (in that order, since Larry did swipe it from Guido) is to have the finalizer be a method just like any other method, you look for it and you call it if you find it. If you want your parent methods called, well... you'd darned well better redispatch the call or you're in a lot of trouble. Needless to say, people are sometimes in a lot of trouble. Compounded in perl 5 by a sub-optimal redispatch method, though the NEXT module's finally made it possible to do things right, though only in new code. (Perl's SUPER only looks at the parent classes of the current class, so if you're on the left-hand leg of a multiply inheritant tree when you SUPER you'll never see the righthand leg. I don't know if Python is similarly broken)

Both schemes, as they're actual object methods, have the interesting property of potentially ending up reparenting the dying object, thus making it not dead any more. This can be something of an issue if the object is half-dead when some finalization code decides that the object isn't really dead, and you have some half-deconstructed object lurching across the system like an extra from one of the Dawn of the Dead movies. (Or a bad freshman english essay) The Frankenstinian possibilities tend to keep people from doing this, though, something I approve of.

Ruby brings a third scheme to the table--rather than a method on the object, you have a finalization closure that gets called when the object dies and bits of the object are passed in. This way you have a means of cleaning up without actual access to the object, so there's no chance of bringing back to life--if you want it not-dead you'll have to clone the thing rather than resurrect it.

Anyway, three ways to pick up the trash. Each can be dealt with, and each has its drawbacks when taken individually. The real fun comes in when you mix and match, because how then do you satisfy the constraints and expectations of all the classes in an inheritance hierarchy? More importantly, how the heck do you order them?

If someone's got a good answer (besides "don't do that!") I'd love to hear it...

Posted by Dan at 05:17 PM | Comments (7) | TrackBack

More and more on objects

With the first release to support objects out, the question is now "What the heck can you do with the things?" (And no, I don't know why people ask me these sorts of things, since I don't really like objects. Go figure)

That's a good question. Mostly a better question than "What will I ultimately be able to do with the things" since who knows, maybe I'll give up half-way through and say good enough. It's been known to happen before. (With objects in general, not with me and objects specifically, but that's a separate issue for another day. Tomorrow, maybe)

So, we've got a system where we can call methods, though only specific methods with no fallbacks. We can have classes with one or more parents. Each class in a hierarchy can have as many private slots in an object as it wants. We have a namespace system that works OK as long as you only have a single level of namespace. (And it's possible that, by the time you get around to using namespaces, that it won't even be an issue, as getting this fixed up right is high on the list 'o things that need doing)

It is, oddly, amazing what you can do with this sort of thing. It's enough to support many statically typed object systems, though you do need to disallow operator overloading. The lack of implicit construction's a bit nasty, but you can actually get around that with the compiler if you so chose. No implicit destruction's more problematic. On the other hand I have at least three separate, potentially conflicting, destruction schemes. (Don't ask, you don't want to know. I'll tell you anyway, but later) Can't make do without destruction, though, not really. Sorta.

Which means that parrot objects are at a state where it's worth digging in in preparation for 0.1.1, which will have construction, destruction, and fallback methods, but probably not an end unto itself. Yet. ;)

Posted by Dan at 02:29 PM | Comments (0) | TrackBack