October 21, 2004

What the heck is: Finalization

Chris Brumme made a blog posting (ages ago--this has been sitting in my pending queue for a while) that reminded me about this, so I thought I'd go into it before I forgot. (I'd recommend reading that link, too--while it deals only with finalization in a .NET environment, and Microsoft's .NET environment specifically (Mono and dotGNU may well have different details) it gives a good overview of some of the more... interesting issues that can be brought up by finalization)

Anyway, finalization is the process of letting an object that's now dead have one last shot at cleaning up after itself. Finalization is not the same thing as destruction, though the two terms are often used interchangeably, and in many systems they occur together. For the record, while finalization is letting an object clean up after itself, destruction is the system managing the object reclaiming the resources it uses. If you want a concrete example, consider the humble filehandle object. This is an object that represents a file. Moreover, it automatically flushes the buffers and closes the file when the filehandle is no longer referenced. Not unusual behaviour for a filehandle. (Well, at least not in perl. Other languages may vary) The finalization for that object is the closing of the underlying OS file. The destruction of the object is the system deallocating the memory for the buffers and putting the now-dead object on the object free list for later reallocation. Almost all object systems allow you to have one or more finalization method for an object. These finalizers are optional.

So, when system decides the object needs to die the finalizer is the routine that gets called to do any last gasp cleanup.

Simple, right? Well... for you maybe. Finalizers are one of those things that give folks doing VM and runtime design absolute fits, at least when they're coupled with automatic memory management.

In a language like C++, where objects only die when the code tells them to die, things aren't too bad. (Though there are still issues, or so I'm told) With a system that does more active garbage collection, though, things get nasty. You have issues of reanchoring, finalization time, finalizer object usage, resource allocation, and environment availability. Sort of. Especially when combined with a need for speed and efficiency.

But automatic memory management is so useful that the problems are generally worth it, especially in a multithreaded system where the nondeterminism gets so bad there's no sane way to do your own memory management. (Not that writing a GC for a threaded system's at all easy, but that's a separate problem) Different languages solve the problems in different ways, based on system requirements, the amount of work someone was willing to put into the system, or how much of the problem the designer ultimately understood (or was willing to allow that app programmers would understand). Still, you've got problems.

The Problems, in no particular order

Before going further, it's worth noting that not all these problems affect all systems. Some of them (like reanchoring) are inherent in finalizers, while others, such as resource constraints, are issues because of choices made when designing the GC system. Depending on how the system you're using is implemented you may only have to deal with some of these problems.

Reanchoring

Reanchoring is when an object's finalizer brings it back too life. For example:

 FINALIZE {
   a = global 'Foo'
   a[12] = self
  }  

That is, the finalizer for the object goes and looks up a global array and sticks itself into that array. That makes our dead object... not dead. It's now anchored, and if the code that handles calling the finalization doesn't notice the object'll get deallocated and the memory thrown into the free pool, and now a[12] holds garbage. Not good, as you might imagine, and detecting it can be tricky in a system that doesn't use refcounts to track object usage. Or expensive. Sometimes both. (The 'easy' way is to have a "mostly dead" flag you set on objects with finalizers and, if after the finalizers have run, the object is still unreachable then you reclaim it, or use reclaim queues)

And, of course, you have issues of safety -- can you actually reanchor an object in the middle of finalization? Often you can't, since the object may well be partially destroyed. This'll happen in those cases where several of an object's finalizer methods have fired and then one of them decides to reanchor. (Since you're firing off all the finalizers in a class' hierarchy -- OO encapsulation makes it such that you really do need to execute all the finalizers the same way you need to execute all the initializers)

Of course, actually catching a reanchor's a tricky thing too, potentially fairly expensive. You almost want to wrap all visible global objects in virtual saran wrap, so they can be looked at but not touched. Which isn't easy.

Finalization Time

Finalization time is an issue for realtime systems, but can still crop up other places. This is where the finalizer takes a large, potentially unbounded, amount of time to do its finalization. (Or, worse yet, just up and blocks) Not too big a deal for realtime systems, since if you're working real time you're taking all sorts of precautions anyway, but still... a pain.

The other issue with long finalizers is that generally all other threads in your process trying to allocate resources will be blocked until the finalizer finishes. Finalizers run when objects are being cleaned up, and that generally happens because an allocation has failed. If you have a single finalization thread and multiple 'real' threads (that is, threads actually running the program, as opposed to housekeeping threads like one running the GC) you can stall a good portion of your program unpredictably, which isn't a good thing.

Finalizer object usage and resource allocation

One 'fun' issue with finalizers is that they're generally resource-constrained. That is, they have only a limited amount of memory or free objects to access, with that limit often approaching zero. Not too surprising -- the garbage collection run that found the dead objects needing finalization was likely triggered by resource exhaustion. (Not always, of course, since at this point pretty much everyone that doesn't use refcounts does some sort of continuous collection) This makes for fun coding, since it's tough to do much in languages with finalizers that doesn't involve allocating something. (Don't forget, in OO languages your temporaries are all objects, likely dynamically allocated)

Environment Availability

When finalizers are run they're often run with some restrictions. In systems with GC running in a separate thread there are sometimes a lot of restrictions. If you've a language that guarantees thread-local data, well... you don't have access to it in the finalizer, since you're in a different thread. Some languages place restrictions on what each thread can see or touch across threads. And even in languages that don't mind you've issues where potentially what was a single-threaded program is now a multi-threaded program, and you either need to have full synchronized access to your data or get ready to core dump.

Even without a separate GC thread, often the GC system is running in what's essentially a fragile state, akin to the environment that a Unix signal handler or VMS AST handler has. There are many system and library calls you just can't make, because you can't be sure of the state of the internals of the library or system. (If a GC sweep was triggered because you ran out of something, you might find you ran out in the middle of doing something that's not re-entrant, with data structures half-modified)

All of this is why people writing VMs and object systems in general have a hate/hate relationship with finalizers. If only the darned things weren't so useful...

Posted by Dan at 01:57 PM | Comments (9) | TrackBack

November 29, 2003

What the heck is: A type

Update: I forgot to finish the type inferencing section. D'oh! Fixed.
It's a good excuse for a fight, at least in some circles. And papers. Lots and lots of papers.

But, more usefully, a type is a notation on some piece of data that marks a promise as to how that data behaves. (If you're looking for sheep, try elsewhere) They may be an indication of the storage that a data needs, the class an object belongs to or inherits from, or the values that data may take.1

Types, when used well, allow you to avoid, or at least reduce the incidence of, a whole class of errors in your programs, so called "type errors". These occur, as you've probably figured, when you're using the wrong type--trying to call a method on an integer, or do addition with two filehandles or something like that.

With a properly designed type system it's possible to prove at compile-time that type errors are impossible in your program. How useful this is depends both on how rich your type system (if the only type you have is "int", it's of limited use) and how much use your compiler can put the type information.

Static and Dynamic Typing

These two are the poles (or perhaps lightning rods) of type checking. Simply put, static typing assigns and checks types at compile time, while dynamic typing assigns and perhaps checks types at runtime.

Before going any further, it's important to realize that Pure Static and Pure Dynamic typechecking are two ends of a continuum. With a very few, generally pedagogical, exceptions no language is on one end or the other. Pretty much all languages fall somewhere in between. Whether this is a good or bad thing is a matter of opinion, but we'll dodge that question for now.

Note that a language can be statically typed yet still have some level of runtime dynamism, such as function or method redefinition or instantiation, reference or pointer based function or method calling, and runtime compilation of code. The language just needs to ensure that any assertions made at compile time remain correct in the face of runtime changes--function signatures may not change with a redefined function, for example, and in the face of compile time uncertainty (such as dispatching through a function pointer of unknown origin) that the constraints of the type system are enforced at runtime. More on this later.

C is an adequate, or at least nearly ubiquitous, example of a statically typed language. The types of all variables must be declared, the return types of all functions must be declared, and the parameters to all functions must be declared. (That it's possible to lie in many profound ways to the C compiler is a separate matter we'll deal with in a bit) The compiler knows how everything should be treated, and can act accordingly. For better or worse.

Python (and Ruby, and Smalltalk), on the other hand, is a good example of a dynamically typed language. In python the types of variables, values, parameters, functions, and methods are not known to the compiler, and in fact may change as the program runs. (Indeed, with these languages there arguably aren't any "variables" as such--just names which are bound at runtime to different things as the program runs.2) In languages such as this, if the type of something is important generally the programmer has to explicitly check.

Before going off on the "But the behaviour of an object is more important than its type!" rant, assuming you conflate classes and types (which is OK by me), I should note here that with a sufficiently rich type system the behavior of a thing is reflected in its type. Which is to say that methods and interfaces that a class implements are as much a part of its type as the class' parents or the class itself. That's partial types, and we'll go there later. Very few languages support them as part of the static type information, unfortunately.

Strong and Weak Typing

Another quality of a type system is its strength--basically an indicator of how easily and to what extent your program can lie to the type checking and validation code and get away with it. A type system is considered strong if you can't lie, and weak if you can lie. (And broken (or at least laughed at by its peers) if it starts out strong and then gets officially hacked as time goes on to allow it to be weakened. A lot of languages fall into this category, as practicality asserts itself)

C is a good example of a weakly typed language. Yeah, sure, C has data types, but they're more suggestions, or indications of current usage, than anything else. As an easy example, look at this:

  int foo;
double bar = 12.5;
foo = (int)bar;

It's a simple enough little chunk of code. foo is an integer, and bar is a double-precision floating point value. A straight assignment from one to the other is arguably illegal, as they're different types. (C will allow it, but it generally yells) That (int) bit in there, though, tells C to treat what follows as an int, rather than whatever type it is.

This is generally called type casting and there are two ways to do it. In this case, typecasting performs a conversion--C knows how to get the integer value from a floating point number and does so. With a conversion bar never loses its inherent float-ness--the bits that the name refers to are still considered part of a floating point number, we just then translate it to an integer. Casts can also change the way that the bits themselves are interpreted, for example:

  int foo;
double bar = 12.5;
foo = *(int *)&bar;

will, rather than treating bar as a float and convert it to an integer, assume that bar is, in fact, really holding an integer and interpret the first four bytes of it as if it were an int. (More or less 4 bytes at least, depending on the size of an int) We have to play pointer deref and cast games here because C won't let you directly change the interpretation of the underlying data storage for a variable, which is probably for the best.

Perl, oddly enough, is a good example of a strongly typed language. (The rich--and arguably excessive--behaviour of its base types tends to confuse folks on that account) A scalar is a scalar is a scalar--no matter what you do you'll never convince perl to treat it as an array, and when you have a reference to an array you can't, unlike in C, treat it as a reference to a hash or scalar or filehandle. There's no casting possible, either directly or through references.

Most languages that do the C-style "interpret the bits differently" cast are low-level languages operating on either primitive types or explicit relatively primitive structures. Once you get abstract enough (and most OO languages are abstract enough) this low-level direct bit hitting stuff just isn't allowed as there's no way for it to be useful without an understanding of a number of very low-level implementation dependent things (like class and object layouts) that's outside the range of the normal. (Not to mention the fact that the folks writing those sorts of languages tend to be opposed, on general principles, to that sort of low-level access)

Higher-level type casting as conversion is more commonly supported, as its often very useful--if for no other reason than to be able to turn strings (like, say, from user input) into structured data (like real integers, floats, or widgets). Opinions seem mixed as to whether this is a good thing or not, but practicality usually wins and it is supported, though generally it needs to be explicit. Very few languages will implicitly autoconvert or call autoconversion routines for you since it's assumed that if you're assigning a value of one type to a variable of another that you've made a mistake and its up to you to tell the compiler otherwise.

Partial Types

As we alluded to earlier, a type system can be far richer than just a plain "what class does this object belong to" or "Which structure are we allocating for you" thing. Types can also express behavior--what methods you have or interfaces you export--and ranges of permissible values. Since this means a full type may have more than one "thing" that defines it, you have cases where you don't really care about the full type--just part of the type.

For example, let's consider GUIs for a moment. It's common when doing GUI programming in a language that supports objects to have a class for each thing that can appear in a window, classes for windows, classes for events, classes for sounds, classes for actions--classes all over the place. As we're in an OO mood, each class is a type.

Now, consider the case where we're doing at least some level of static typing in our OO GUI language. The common thing to do, in a static type system, is to require each parameter to be a class or child of a class, which is fine, but... wait until you find yourself building objects that need to inherit from several parent classes at once. This isn't too tough to do--just throw serialization into the mix, and you'll immediately feel the pain. (The drawing methods require the parameter to inherit from some base Drawable class, and the serialization methods require their things to inherit from the base Serializable class)

The common way to deal with this is to ignore the heck out of all but one type constraint (Generally Serializable things take generic objects and probe methods at runtime) but, well.. ewww. The second-most common way to deal with this is multiple inheritance, but folks tend to get their rant on when dealing with MI. Partial types solve the problem. (And MI is a particular case of partial typing but Shh! don't tell)

With a type system that has partial types you declare that your drawing method must have a type of Drawable, but that doesn't preclude the object from having a type of Storable, Readable (if you've got a text-to-speech system), and maybe PhenomenallyUgly. (Though, with GUI systems, the last type is often a freebie)

With a type inferencing system (we'll talk about that in a bit) often you'll get partial types whether you know it or not, as they're handy things for optimizers--any information, even incomplete information, is better than nothing.

Runtime type checking

With some languages, compile-time typechecking is insufficient--there just isn't enough information at compile time to be sure that the rules of the type system are being followed. (This is especially true of type systems that include the value as part of the type) Languages like this generally provide runtime type checking, either automatically or via calls you can make. Runtime type checks are exactly what they sound like--as your program runs it checks the types of variables and values that get passed around.

This does, as you might expect, have a cost associated with it, since checks like this aren't free. (They're often quite cheap but even cheap adds up when you do it a million times or more) Runtime type checking also isn't restricted only to languages with some measure of dynamic typing associated with them--even the terribly strictly typed languages can have it. Rather it tends to be associated with languages with some measure of strong type checking. It's difficult to write a broad range of Interesting Programs without having at least some interface to the real world, and the real world tends to be a messy, typeless place. Runtime checks are used at the borders to make sure the data coming in conforms to the type information it's declared.

The more dynamic languages do tend to have more runtime type checking, though, since the more dynamically typed a language is the harder it is to check at compile time that the type invariants are enforced. If you can't tell, and you want to be sure, you have to resort to runtime checks, throwing type errors or exceptions if the invariants vary.

You'll often see, in dynamically typed languages (as well as statically typed languages with insufficiently expressive type systems), hand-rolled type checking at the start of functions and methods to make sure that the invariants the programmer wants expressed are expressed. Personally I've always disliked those, as they're both messy and error-prone, but often there's no good way to do anything else in the language at hand.

Type Inferencing

Type inferencing is when the compiler figures out what type most variables are and what functions return without the programmer necessarily having to explicitly type variables or functions. It's one of those neat tricks that a compiler can do because there's so much redundant information in most typed source code.

What the type inferencing engine does is trace the source of your program and makes type annotations based on the code, and uses these annotations to find the places where you've made type errors, without you having to actually put type notations everywhere.

The first thought that might spring to mind is "what about that pesky halting problem thing?" and yes, it's generally impossible for a type inferencing engine to be 100% correct for 100% of the programs you'll feed it, but that's fine--it doesn't have to be perfect to be useful. For example, this:

sub foo(a, b) {
declare c;
c = a.some_method(b);
return c**4;
}

q = foo(1, 4)
print substring(q, 1, 4);

That code has two type errors in it. First, foo returns a number of some sort, but we're doing a substring on it--unless we've got a language with lots of autoconversion that's not allowed. Second, the first parameter to foo must be an object, because we're calling a method on it, but when we invoke foo we're passing in an integer, and unless we've a language that assumes everything's an object that's not allowed.

It's clear, by inspection, that those errors exist, but in a dynamic language they won't be caught until runtime. Or... will they? A type inferencing pass on the source will pick them up, and they can be reported at compile time. That's a good thing, since generally the earlier an error is caught the better.

Of course, you can throw some uncertainty into the mix:

sub foo {
if (rand(4) >2) {
return "a";
} else {
return 12;
}
}

Still, that's OK. With a simple type inferencer it can just decide that the return type of foo is unknown and let that chunk of uncertainty flow through, in which case you're no worse off than if you'd not had the inferencing engine. Or... Very Clever inferencing engines can see that it returns either an integer or a string and, while it can't be sure which, it can certainly know that if you use the return value as, say, a matrix that your code's broken.

Types and objects

Static typing gets somewhat interesting when objects are involved. Or, rather, they get somewhat interesting when subclassing is involved, if the type notation on a value might indicate just a parent type rather than the actual type.

If your type system allows for subtypes and your object system allows child methods to have different method signatures from the parent methods they obscure, it makes it very difficult to do full static type checking. In this case you have to resort to runtime type checking if you want to maintain those type invariants, which has a performance penalty, and tends to make it more difficult to make sure your programs are right to begin with.3 This is one of the reasons the more dynamic OO languages tend to not bother with type checking, as without a good type inferencing system you very quickly get to the point where there's so much uncertainty that types are more hindrance than help. (Even with a good type inferencing system in somoe cases)

What's the point?

Ultimately, what the heck's the point of types? (Besides the excuse for an argument, of course) There are three big ones.

First, from a theoretical standpoint, they may, in the right system, allow you to prove your code is correct4 at compile time.

Second, they help catch some typos, provide a bit of help catching API changes (the ones that involve changing the types of the parameters, though that's arguably a Horribly Bad Thing), and provide something of a memory crutch for using libraries. (I personally find that last one useful, since I find details like argument ordering occasionally dribble out of my brain for less-used functions)

Third, they help the optimizer, and this is the one I like the best. Having types available at optimization time can mean the difference between knowing that foo is a generic thing and foo is a platform-native integer, something that can mean an order of magnitude or more difference in speed of use and memory usage. Not, perhaps, a huge difference for one variable, but it's something that can add up quickly.

In object-oriented languages typing also helps tremendously with method lookup. Code like this:
b = a.foo
can have the foo method pre-looked-up with a typed system. Without sufficient information at compile time, methods lookups all have to be be deferred to runtime. That means that every time the code snippet above is executed the runtime has to take the string "foo" (or perhaps the magic method token that corresponds to "foo", if your language can do that) and search the class hierarchy looking for that method. A good method cache can let you skip all but the first heirarchy search for a method, but you can't skip them all, and a lookup, even with a precomputed hash of the method name, is somewhat expensive.

On the other hand if, at compile time, you know the structure of a class hierarchy and the types (even just the base type) of a variable, you can build a vtable for each class, with one slot per method, and hang it off each object (or the class, at the cost of an extra dereference) and the runtime cost of the method dispatch is just an indirect array access, which is definitely a lot cheaper than any sort of string lookup. All you need to do to make that work is make sure that methods of the same name are always in the same slot for each branch of the tree, which isn't tough. (Just a bit of compiletime bookkeeping, something computers are reasonably good at)

Good optimization, however, doesn't require the source to be statically typed, it just requires the parsed representation of the source to be typed. (Type information reduces the uncertainty in a program, and uncertainty is the enemy of optimization) It's possible to have a dynamically or minimally typed source language that still has some type information made available to the optimizer--this is one of the places where type inferencing is really, really handy.

Why the fights?

Well, bluntly, because some people like fighting, and this is a good excuse. That's not a helpful answer, though.5 The battle is generally between the tinkerers and the mathematicians. Tinkerers (and hackers in the classic sense fall into this group) just like to get going, fiddle around, do their thing, and have the compiler get out of their way

Which is the right answer? Well... it depends, of course. Depends on the people involved, as the differing styles of typing appeal to different people. It also depends on the problem, since some problems require more stricture than others. (I, for one, want the code running a plane I'm in to have all the checks run on it as possible) And it depends on the performance requirements you have, since the more statically typed a language is the faster the code can be. (Not necessarily will be, of course, but the less uncertainty the optimizer has the more it can optimize) There really isn't One Right Answer. That'd be too easy. :)

1 Yes, a type could be "even numbers between 1 and 6" or "words that don't have the letter Q in them", though value based type systems are reasonably uncommon.
2 That's picking nits, and those "things" are really variables. But this isn't the place to get into the differences between names, variables, and values
3 You have to have a really good set of unit tests to catch this stuff in development or testing and, while unit tests are handy they're sometimes difficult to get 100%, and can't always check things right in really complex systems. Not that I dislike unit tests--I think they're great. They just solve a different, albeit partially overlapping, problem domain than types
4 For a very restricted definition of correct. Arguably anything's better than nothing.
5 Well, OK, it actually is a helpful answer, but for a more general problem.

Posted by Dan at 03:25 PM | Comments (16) | TrackBack

October 11, 2003

What the heck is: A string

And no, it's not nearly as stupid a question as it may seem.

This time out, we're going to talk about string data, what it means, how its interpreted, and odds are how much stuff you've never had to think about with it. Strings are remarkably complex things. Remarkably confusing, too, if you're mono-lingual in a language with a simple base writing system. (Like, say, me)

A "string", for our purposes, is a series of bits that represents some piece of text in a human language. What you're reading now is text, as is the stuff in your phone book, the heiroglyphics on Egyptian tombs, and the chinese characters for "Ignorant American" embroidered on that faux-Tao t-shirt you like to wear. (Important safety tip--never get a tattoo in a language you don't understand. "Baka" is not Japanese for "good fortune") Strings are how this text is represented inside a computer.

What makes strings potentially complex is the wide range of ways that human writing systems have developed over the millennia. Writing has generally been governed by the capabilities of pen or brush, ink, and paper, which are very different than the capabilities of computers. Because of this, there are writing systems where there are optional or semi-optional notational marks on characters (as in most western european languages) thousands of unique characters (such as Chinese), and writing systems where the shape of characters varies depending on preceding and succeeding characters (like Arabic). Even if we skip how the strings are actually rendered on screen or paper (which takes away the issues of having enough dots to legibly render those thousands of characters, or figure out which direction the text should go, or what shapes the letters should be based on their context) there are plenty of issues to deal with.

Computers, of course, don't handle stuff like this at all well. They're best suited for linear sequences of bits and bytes, the simpler the better. Because of this, many different schemes for representing text have been developed over the years. These schemes have been shaped both by the complexity (or simplicity) of the writing systems the represent and by the limitations of the machines when the schemes were invented. All the schemes have common characteristics, though, so we can talk about them all in the abstract, which is nice.

Computers, ultimately, represent text as a series of abstract integers, and those abstract integers are represented on disk or in memory as a series of bits. The integers are generally called characters, with all the characters for a language taken together called a character set. Because people tend to think that individual characters represent single printed things, Unicode prefers to call them code points, and so do I, so we will. Just be aware that in some circumstances a single code point isn't enough to figure out what glyph should be printed.

Glyphs, on the other hand, are the printed representation of the smallest unit of text in a language. You normally need one or more code points to figure out what a glyph is, though it's not always quite that easy. (Arabic seems particularly odd in this respect, where multiple code points translate to multiple glyphs, but it's not an N code point-> 1 glyph transformation, more an N code point -> M glyphs)

Finally, how those bits and bytes get turned into abstract integer code points is called the encoding. You might think that each individual character set has its own encoding, or that each of those abstract integers is represented by the same number of bits on disk, but you'd be wrong there--life's not nearly that simple.

So, we've got an encoding, which tells us how to turn bits to code points, we've code points, which are abstract integers, and we've glyphs, which are the smallest unit of text in a language, more or less. With those things we have enough information to turn bits into text on the screen.1

Before we go further and really confuse things, let's take a look at a simple example you're probably very familiar with--ASCII. ASCII has a very simple encoding. Each code point is exactly 8 bits, of which 7 are meaningful and one is padding. (Yes, technically bit 8 may be used for parity, but it's been a long time since anyone's cared even on serial lines where it was mildly useful) All code points fit nicely into an 8-bit byte, or octet.2 We'd call this an 8-bit, fixed-length encoding. (I like those, personally--they're simple)

Translating the code points to glyphs is straightforward enough, and you're probably familiar with the ASCII character table. You know the one, it's a big grid with the characters and their numeric values on it. A is 65, B is 66, and so on. Now, there's actually a lot of hidden semantics in there, but we'll deal with those later. It's important to note that, when people talk about ASCII, they're talking about both the ASCII character set and its the encoding.

EBCDIC is another encoding/character set combo you're probably familiar with, or at least heard of. Like ASCII, it uses an 8-bit fixed length encoding, but the letters and numbers map to different integers. A is 192, B is 193, and so on. (More or less--there are a number of EBCDIC character sets) Once again, reasonably simple and straightforward.

A third encoding/character set combo is RAD-50. (A blast from the past for you) This is a set that has 50 (octal) characters in it, and it packs three characters into a 16-bit word. It's a restricted character set, having only the 26 english upper-case letters, the digits 0-9, $, _, and two other characters that I forget at the moment. Each character takes 5 1/3 bits, more or less, and there's no way to tease out the characters without doing some math on the word.3 The character set, then, are the integers from 0 to 39 (50 octal is 40 decimal), while the encoding requires doing some multiplication, division, and/or modulus depending on which character in the stream you were dealing with, and decoding the characters is position-dependent. RAD-50, honestly, is a pain.

Variable-width encodings are ones where there isn't a fixed number of bits. (RAD-50, oddly, is a fixed-width encoding. It just isn't encoded with an integer number of bits...) There are a lot of different ways to handle variable-width encodings, but the two common ways are to pack the data into a series of bytes and to use escape-bytes or marker bits to indicate that you should keep on going.

Escape bytes are ones where, if they occur, indicate that the byte that follows is part of the character. That means some code points are one byte, some are two. (in really extreme cases some are three) The escape bytes note that the byte following is part of the code point. There may be one escape byte in the set, in which case you get 511 code points, or N escape bytes in which case you get (256-N) + (256*N) code points. And a fair amount of inconvenience, but that's secondary. Most encoding/charset combos that have escape characters start out with a base character set (usually, though not always, ASCII or an 8-bit extended ASCII) and make all the unused code points escape code points.4 For example, with Shift-JIS (one of the ways to encode Japanese characters) bytes 0x80-0x9F and 0xE0-0xEF are escape bytes, and note that the following byte is part of the code point.

Marker bits are similar, but rather than saying "Codes x-y indicate an escape byte", you say "if bit(s) x (and maybe y) are in some pattern, the next byte is part of the code point", and you build up the final character value from parts of the bytes. Unicode's UTF-8 encoding does this--with it you can encode integers of up to 32 bits in a series of bytes, from 1 to 6 bytes, depending on the integer you're encoding. (The bit encoding's a little odd--if you're interested, this is documented in chapter 3 of the 3.x Unicode standard)

So, anyway, fixed-length encoding and either escape-byte or escape-bit variable length encoding. That's how you turn integers to bytes in a bytestream, or vice versa. Personally I prefer fixed-length encodings, though there are good reasons to use variable-width encodings in data files. (Like, for example, the fact that pure 7-bit ASCII text is also valid Shift-JIS or UTF-8 encoded data. And JIS-X-20x or Unicode characters. But that's in a minute)

Once you run the bytestream through the encoding you get a stream of code points--these abstract integers that more or less represent individual letters/symbols/iconographs/whatever. Is that enough to get a glyph you can display on screen? No! Of course not, it's never that easy. Or, rather, it's not that easy for some sets, and is that easy for others. For many character sets, usually the single-language sets such as ASCII or the JIS sets, there's a one-to-one mapping of code points to glyphs, and each glyph has a unique code point assigned to it. For the multi-language sets, especially Unicode, things are more complex. Since Unicode's the common multi-language set, we'll take that one specifically.

Uncode took on the task of building a single character set that encode all the world's5 written languages. This is a massive task, made more massive by the fact that some written languages don't really have the concept of single, individual glyphs, and others build a single glyph out of multiple parts. (Hangul and some of the Indic scripts apparently do this) Actually having all the legal combinations in the character set is infeasable6 so Unicode introduces the concept of combining characters. These are modifier characters that alter the code point that precedes them in the stream of code points, making that character different.

A nice, simple example is the accent character. This is a mark that indicates where the stress goes in a word, like the one over the i in naíve. Unicode has a combining accent character that puts an accent over the previous non-combining character. So for an accented I, the sequence (in Unicode character names) is LOWERCASE I, COMBINING ACUTE ACCENT. Just for fun, Unicode also has a single character, LOWERCASE I WITH ACUTE ACCENT, that also represents í, which can make for some fun. The two sequences are, according to the Unicode standard, identical, which leads to some interesting code, but Unicode deserves its own WTHI entry so we won't go into any more detail here. Just be aware that some glyphs on screen may be composed of multiple code points, and sometimes there are multiple ways to represent the same glyph that should be treated identically.

Note that you can also mix and match encodings and character sets. As long as the abstract integers that an encoding encodes are large enough to hold the code points for whatever character set you have, you're fine. That means that you could encode JIS-X-20x characters using UTF-8, or ASCII characters in UTF-32. People generally don't, mainly for historical reasons, but there's no reason not to. (This can actually simplify things if you choose to use UTF-32, which is just a 32 bit fixed length integer encoding, for all your data regardless of what character set it comes from, but that's up to you)

So, anyway, we can go from bits on disk to glyphs on screen. Good enough for strings?

Hah! Not by a long shot. Would that it were so easy. And no, Unicode doesn't solve all the problems.

Just being able to display a string, or at least puzzle out the glyphs in the string, isn't enough to properly work with a string, at least not programmatically. For example, how do you uppercase the first letter in élan? Is it Élan, or Elan? and how does it sort? Does é sort before or after a plain e? And if you have a string of chinese characters, where are the word breaks?7 The answer is... it depends. It depends on the language the string comes from, because different languages have different rules. Chinese, Japanese, and Korean all use chinese characters, but how they use them is different and where word breaks are vary. What happens to accented characters depends on which European language you're working with.

Sometimes you can ignore language issues--for example any sort of binary string operation will likely just choose one rule. (Otherwise how would you decide which ordering rule to use if the two strings you're comparing have different and conflicting rules?) Other times you can, but really shouldn't, ignore the rules. When uppercasing a string, for example, in those languages where there's even a concept of case, you should respect the rules of the language the string came from.

So, to truly handle string data, your program needs to attach an encoding, character set, and language to each string, or at least have a set of defaults. (Often in multilingual programs everything is transformed to the Unicode character set with either a UTF-8 or UTF-16 encoding, though transforming to Unicode's not always lossless, depending on the unicode version)

What does this all mean to you, the programmer? Well, that depends. If you're working with text in a single language, not much, as you've got a single set of rules that you likely never even thought about. Even if you're using Unicode (because, say, you're doing XML work) it still doesn't necessarily mean much, because even though you're using Unicode you're probably only encoding a single language, or at worst encoding text from multiple languages but not manipulating it so it doesn't matter. If you're doing real multilingual text manipulation, though, it means there's one more attribute to text than you probably thought and, while you can sometimes ignore it, you can't always ignore it.

After all, this is text in human languages, and the computer ought to do what we need, rather than us doing what the computer needs.

1 Note that this isn't sufficient to manipulate the text. We'll get to that later.
2 Yes, technically a byte might be more or less than 8 bits, but when was the last time you worked on anything else?
3 This, by the way, is why DEC operating systems traditionally did text in multiples of 3 characters with a very restricted character set--everything's RAD-50 encoded for space reasons. Seems silly now, but when your machine has only 128K of memory total for a multiuser system, every bit counts.
4 And if they run out of escape characters and still need more extra characters, they start yanking things out of the base set and throwing them into an extension plane.
5 And some other world's written languages. cf Klingon and Tengwar. Yes, I know Klingon's been rejected and Tengwar probably will be but, like so many other things rejected by the sensible,de facto wins...
6 Combinatorial explosion soon gets you a character set with a few hundred billion characters. And while it's possible to have a 64-bit range for characters, well.. that's just a little nuts. If nothing else, imagine the size of the printed Unicode standard with full character sets!
7 Word breaks are a fuzzy enough thing in languages that use chinese characters anyway--you've got to use a dictionary, some heuristics, a bit of luck, and some random numbers some days.

Posted by Dan at 02:32 PM | Comments (8) | TrackBack

June 26, 2003

What the heck is: A tail call

Tail calls, and their more restricted brethren tail recursion, are one of those clever optimizer tricks that were invented many years ago, then promptly forgotten by most of the mainstream language implementors. (Naughty them!)

Consider, for a moment, this simple function:

sub foo (int a) {
  a += 15;
  return bar(a);
}

Not a big deal at all--takes a parameter, adds 15 to it, then returns the result of calling bar on the new value of a. The important thing to notice here is that the very last statement calls a function then immediately returns the result of that function.

Now, think of how code's generally generated. When you enter a function, some stack space is allocated. When you exit from that function, the stack space is deallocated and a value is generally returned. So the sequence of events looks like:

STACK ALLOCATE
(do stuff)
call function
STACK ALLOCATE
STACK DEALLOCATE
RETURN WITH VALUE
STACK DEALLOCATE
RETURN WITH VALUE

There are several interesting things here. If you look at it, the stack space that foo has isn't used after the bar function is called. While this isn't a huge amount of space, it is space, and space that isn't needed. If it was freed before bar was called, your program would be a little more space-efficient. For one function this isn't a big deal, but imagine what happens if you're 200 levels deep in a call (perhaps because there's some indirect recursion) and 100 of those levels look like return somefunc();. That's 100 chunks of stack that are being used for no reason. If your stack only had space for 199 chunks, not using those 100 chunks of stack would mean the difference between your program running and your program crashing.

There's also that pesky "return what I just got" action. In our example all foo does at the end is collect the return value from bar and then immediately return it. The act of collecting and returning the value is essentially wasted time. If somehow foo could make itself go away so bar returns its value to whoever called foo, we'd save the time it took to do that extra value collection and return. Is this a whole lot of time? No, not really, unless there's something really bizarre about the language you're using. But even if it's only a few microseconds, over thousands of calls it can add up.

So, what can you, or rather your language compiler, do? Well, what it can do is turn code of the form return somefunc(); into the low-level sequence pop stack frame; goto somefunc(); In our example, that means before we call bar, foo cleans itself up and then, rather than calling bar as a subroutine, we do a low-level goto operation to the start of bar. Foo's already cleaned itself out of the stack, so when bar starts it looks like whoever called foo has really called bar, and when bar returns its value it returns it directly to whoever called foo, rather than returning it to foo which then returns it to its caller.

This, simply, is a tail call. Tail because it has to happen as the very last operation in a function, or as the function's tail operation. Call because that last operation is calling a function. Making tail calls can sometimes (though not by any means always) require support from your platform's calling conventions, linker, and runtime library. It also needs a non-trivial amount of effort in the compilers that support it.

There's also a more restricted version, called tail recursion, which only happens if a function, as its last operation, returns the result of calling itself. Tail recursion is easier to deal with because rather than having to jump to the beginning of some random function somewhere, you just do a goto back to the beginning of yourself, which is a darned simple thing to do. That means code that looks like:

sub foo (int a, int b) {
  if (b == 1) {
    return a;
  } else {
    return foo(a*a + a, b - 1);
  }

gets quietly turned into:

sub foo (int a, int b) {
  label:
    if (b == 1) {
      return a;
    } else {
      a = a*a + a;
      b = b - 1;
      goto label;
   }

What was a recursive function, chewing up lots of stack gets turned into a loop that chews up just one stack frame. And the compiler can do it as a simple under-the-hood transformation that needs no special knowledge of any other function's internals, or special function entry points or anything. This is a much easier piece of source transformation than full-blown tail calls for many languages, so it's more likely (though still not common, alas) to be implemented in a compiler.

Now, if you were taught to program using one of the more common procedural or OO languages, you were probably warned to avoid recursion because it was very easy to blow your stack out and die, which it is. If your compiler supports at least tail recursion (if not full tail calls) though, it means some of the recursive algorithms that used to be unsafe are now perfectly safe, since you're not allocating a chunk of stack for each level of recursion, which is a Good Thing. This can make some problem solutions easier, since you don't have to do the goto-style transformation yourself. (Especially since everyone these days seems to have a pavlovian-conditioned aversion to gotos--a shame since they're quite useful in a limited range of places--so you'll almost inevitably do bizarre contortions with infinite loops or something equally awkward)

How big a deal are tail calls and tail recursion? In a language like Lisp where just about everything requires a function call, and it's really easy to have a dozen or more functions in progress just in your code alone (not to mention the library functions your code calls, and the library functions they call) it makes a huge difference. Doing tail call optimizations are also really easy for Lisp, just because of the way the language works and is structured. Once you know about tail calls, it just falls out of the compiler, which is cool.

In procedural and OO languages the benefits are a bit smaller, since there are generally fewer function calls and more code executed between function calls, and many of the returns don't return just the result of a function call, but even then it does make a difference. Stack frames for procedural languages are usually larger than Lisp frames, since you have more stack-based variables--a C or Java stack frame may well be 10 or 20 times larger than a Lisp stack frame (since C and Java have stack-allocated variables, something Lisp generally doesn't have), so even if you can do a tail call optimization 5% or 10% of the time you'd save as much as you would for an entire Lisp run, give or take a lot of statistical handwaving. They probably won't happen that often (I'd bet the number's more like 1% or .1%) but still, a win's a win.

Is there a downside to tail call optimization? Well... no, not really. The only place it causes a problem is if you have a language that supports introspection sufficiently to allow you to inspect the call stack, as you'd not see any of the functions that tail-called out of themselves, but this isn't generally a big deal--the space and runtime win are considered worth the hit to introspection capabilities. They also put a bit more complexity into a compiler, which is likely already a pretty damn complex piece of code, so there's a burden on the language implementor, but if you think about it at the beginning it's not too big a deal to implement. (And if you have a continuation passing style of calling functions, it turns out to be essentially free, which is really cool, though the topic of another WTHI entry)

So, tail calls. Your friend if you're a programmer, and a darned useful thing to implement as an optimization if you're a compiler writer.

Posted by Dan at 01:48 PM | Comments (3) | TrackBack

June 06, 2003

What the heck is: Garbage Collection

And, more importantly, why should you really care?

Nearly all programs create garbage as they run1, and nearly all programs clean up that garbage in one way or another. Garbage Collection is what we call it when the cleanup is done automatically. The automatically part is what makes garbage collection interesting and useful.

In a language like C, you can explicitly allocate memory from the system with the malloc function. It takes a size, and returns you back a pointer to the newly allocated memory. When your program is done with the memory, it calls the free function, passing in a pointer to the memory that was allocated, and the memory is given back to the system. Nice, simple, straightforward. And entirely manual.

The big drawback to the manual system is that if you make a mistake, Bad Things Happen. Freeing memory too soon means your program has a pointer to memory it no longer owns. Using that pointer will cause your program to crash as it tries to access memory outside its region. Or, worse yet, access memory that's been recycled and used for other things. Freeing memory too late, or not at all, means that your program leaks memory, leaving chunks of unused and unusable memory lying around. This is less of a problem, as at least you won't crash, but it will mean the system eventually runs out of memory to hand out, and your program will fail.

How do you then track allocated memory to make sure that you hold onto it as long as necessary, but no longer? The way a lot of programs do it is with something called reference counting. Each chunk of memory has an integer counter attached to it. Each time your program makes reference to the memory (generally by storing a pointer to it somewhere) the count is incremented. Each time a reference is deleted the reference count is decremented. When the reference count goes to zero, the memory is freed.

Unfortunately reference counting has a number of problems associated with it. First, of course, your code needs to fiddle with the reference count properly, and everywhere. This is, in a large program, potentially error-prone--memory may live forever, or get deleted too soon. Both are bad, leaving you with the leak or segfault problems. It also clutters your code with refcount fiddling calls and, even if you reduce them to macros or function calls, it's more code to wade through when you're writing and debugging the code.

The other big problem is the runtime cost. All your memory allocations need to be bigger, to hold the reference count. Your code is doing a lot of math and, while it's simple addition and subtraction, it takes time. And it means your deallocation system may need to be fancy, since if a chunk of memory that gets deallocated has pointers to other pieces of memory, those pieces need their refcounts manipulated. (While this is a cost you have to pay in the manual case, a refcount-based system is usually a general-purpose thing, so the deallocator's a generic deallocator rather than a special-purpose one with deep knowledge of the data being deallocated, so it needs to think more when deallocating things)

All this is definitely a pain. (Been there, done that, got the T-shirt) It's far better than leaking memory, though, and it does have the twin advantages of being conceptually simple and very portable. It's the system that many of the interpreted languages such as perl and python use, as well as some of the compiled languages like Objective C. (Though it's as much of a function of the Objective C object library as the language itself, but that's what's in widespread use)

The alternative is real, fully automatic, garbage collection. What a garbage collector does is track all your program's allocations, then it look through all your data and figures out which chunks of memory are in use and which aren't. When it finds pieces of memory that aren't used, it releases them to the system for later reallocation.

Garbage collection always starts by assuming everything is dead. It then begins with what's called the root set--a set of things that are always considered alive. Anything the root set references is marked as living. Then everything that this newly declared "living" memory references is considered alive, so then we walk through that memory, and so on. By the time we're done, anything that's still dead is unreferenced and can be safely cleaned up. In object systems that have destructors, finalizers, last-gasp methods, or other things that get called right before something gets killed, this is the time it happens.

Garbage collection is a nice thing. Your main program code doesn't have to worry about tracking memory, since the garbage collection system handles it for you. That means less bookkeeping code to write, and almost as importantly less bookkeeping code to execute. Your code allocates memory when it needs to, and puts references to that memory wherever it wants. There's no tracking, counting, or reference work

A garbage collector is not a simple thing to write, not by any means. They're very system specific in places, and efficient collectors are generally complex, low-level, nasty pieces of code. That's the big downside. The garbage collection code is usually small and isolated, however, which offsets this to some extent. While you still need to understand it, only one or two people need to understand it to work on it, while the rest of the people working on the project don't have to worry about it at all. Memory allocation bugs are also isolated to a small section of your code--the memory allocator and the garbage collector--rather than scattered across the entire code base.

Luckily there are a number of pre-made garbage collection systems available free for use, such as the Boehm collector, that already handle all the nasty bits. You plug them into your program and they just work. Pre-made collectors tend to be more conservative and a bit slower than a hand-rolled custom collector, as they can't have as much knowledge about your program as you do, but the advantages involved in using a mature code base and just not having to bother are significant. While personally I like writing GC code, if you don't I'd recommend grabbing the Boehm collector.

The big downside to garbage collection is indeterminacy. Your program holds on to memory longer than it otherwise might, can't be sure of how long an allocation might take (since allocating memory may require firing off a garbage collection run if your program runs out), and can no longer predict exactly when some resource that requires active destruction may be cleaned up. While a number of the garbage collection schemes have bounded worst-case memory usage and collection times, their worst case performance (either in memory usage or pause) is generally worse than a refcounting scheme.

Interestingly, depending on your program's allocation patterns, a garbage collection system can have better average and best case performance than a refcounting system. (Both of which are beat by a manual tracking system's best case performance, though the complexity of a fully manual system generally makes it unsuitable for large systems just because it's such an enourmous pain to get it right) A GC system doesn't need to allocate a counter word per allocation and doesn't need to twiddle that counter word, and a GC's code is normally more cache friendly

Refcount schemes can, of course, have lengthy pauses of their own--freeing up the root node of a structure with a million or more elements will make your program stall. A refcounting system also spends significantly more overall time doing memory management tasks, usually twice as much (except in the most pathological cases) so a program with a real GC will generally run a bit faster than one with a refcount system. Not hugely faster, since your program ought not be spending all that much time in the memory allocation system, but every percent or two counts.

1 I do actually know of a few non-trivial programs that, when running, do not allocate any dynamic memory at all outside a very limited, and tightly bounded, bit of stack memory. (And I could probably, with work, put my hands on a few that don't use stack, either) They're quite rare, and somewhat special-purpose, but they do exist.

Posted by Dan at 06:37 PM | Comments (13) | TrackBack

June 03, 2003

What the heck is: multimethod dispatch (MMD) good for?

In the last entry, modulo a rant or two, I explained what multimethod dispatch is, but I didn't really make a good case for what it's useful for. The comments on the entry gave some examples, but I think it warrants a bit more explanation.

I should start by noting that MMD is one of those features that you will use in a few very specific circumstances. This isn't the equivalent of addition we're talking about (though it does get used under the hood as part of addition) here. You'll either need it in a few well-defined places and use it, or you won't. If you don't, well, that's fine.

MMD is used in those places where you have a method that can take multiple types of things, and want to do something based on information in the inputs. Generally this means we check the types of the data, and do things differently depending on what types are passed in, but MMD isn't required to be limited to this. More on that in a bit.

Let's, for a moment, ponder a well-designed OO GUI system. (Yes, I know, but this is a thought experiment. Suspend your disbelief for a moment and follow along :) This system has an object for each GUI element, as well as a fully OO based event system. Even better, this system has typed events, so each event that you can get is a subclass of the generic Event class. Key presses, mouse clicks, window moves, resolution changes--they all have their own object type.

Your program, like so many other GUI systems, just spins its wheels, forever grabbing events and dispatching them off to do... something. Whatever's appropriate. In our case here, let's assume that each event comes with the GUI element the event applies to, just to make things easier. (The system tracks which GUI element each thing happened to, since it's a well-designed GUI system) The main loop of your program may look like:

  while (1) {
    ($widget, $event) = GetEvent();
    $widget->handle_event($event)
  }

How do you process the event? Well, your widget classes all have code that looks like:

  class Window {
    method handle_event(Event $event) {
      switch($event->type) {
        case KeyPress {
        }
        case MouseDown {
        }
        case SingleClick {
        }
        case DoubleClick {
        }
      }
    }
  }

(Keeping in mind that I'm mostly making up syntax here) The handle_event method for each GUI element type needs to check the type of the event and do something based on that event type. This is what happens now, though you generally aren't lucky enough to have the event actually come in as a typed object, instead getting an event structure or something. Still, we're pondering an OO GUI system, and the type of an object is as good a place (or better) as any to put the event type info.

Anyway, big switch statement. Ick. Maintenance pain, definitely, and one big glob of code. How could MMD make this better? Well, the class would then look like:

  class Window {
    multimethod handle_event(KeyPress $event) {
    }
    multimethod handle_event(MouseDown $event) {
    }
    multimethod handle_event(SingleClick $event) {
    }
    multimethod handle_event(DoubleClick $event) {
    }
  }

Rather than one big method, we have one method per type. This gets us a number of wins.

First, of course, the system handles dispatching, rather than our code. That's less code for us to have to write (albeit only a little less) and, more importantly, the system is likely far better optimized for fast dispatch than we can be, because the system is in a position to cheat unmercifully. We, as application programmers, have to play by stricter rules.

Second, we've isolated the code for each event type into separate methods, which makes maintenance a bit safer. Twidding code for one event doesn't risk the code for the rest. (Not a common problem, but a problem nonetheless) There is the potential for code duplication, but that's not usually an issue, since often the code to handle each event type will be in separate subroutines anyway.

Third, we've probably eliminated a sub dispatch. While simple event handling code can be put, with some risk, in the body of the switch, odds are if there's more than a few lines needed to handle it the code will be factored out to a separate subroutine. That means two dispatches for each event, while the MMD form has a single dispatch. (Not a huge savings per call, granted, but if we're in an event-heavy system efficiency translates to responsiveness)

Finally, it's easy to add support for new event types after the fact, and in classes we don't have access to. The window class may well be provided by the system, with a default event handler that takes plain Events and quietly discards them. The system instructions could well be "to handle an event of type X, add in a MMD handle_event that takes an event of the type you want to handle", or "To override default behaviours, add a handle_event that takes an event of type X in your Window child class". This provides a nice system--you get a fallback handler automatically, along with default behaviors the system provides (likely with full and evil access to the internals of the system) without having to explicitly redispatch. You want to handle an event of type X? Great, handle just it. You don't need to have a generic handle_event that dispatches to the parent handle_event for everything you don't understand. (You can, but that's error prone. We don't like errors)

Another place that MMD is used, as I talked about before, is in operator overloading. This is a place where being able to add in options to the dispatch table is extraordinarily useful. Without MMD, dispatch for overloaded operators is traditionally done based on the type of the left-hand side. That's fine if your object is on the left and it knows about the type of the object on the right, but less fine if your fancy object is on the right, or doesn't have a clue as to what to do with the object on the right. (Not that MMD will guaranteed save you, but at least there's an option)

Type based dispatching also has a nice property of allowing for closest-match lookups. If you have any sort of inheritance hierarchy, at some point you'll come across the case where no method exactly matches the types in the signature. In that case, what do you do?

You could just bail, of course. Fall back to the default method, or throw an exception, or otherwise pitch a fit. But that's not tremendously helpful. What most MMD systems do, then, is to find the closest match. This is done reasonably simply if you consider each level of the inheritance tree a unit of distance. If there's an exact match, it's 0 units away. If you have a Foo, and Foo inherits from Bar, then Foo is one unit from Bar. What you do, then, is look at each of the possible versions of a method and find out how 'close' you are to it. If your types all match, you're 0 units away, and you've got the method. If not, then you figure out how far away you are. This can be done simply, by figuring out how far off each parameter you have is and adding the distances up, or treating the parameters in the declared methods as defining a point in N-dimensional space and figuring out how far you are from each possibility. The former's easier, the latter is more correct, so which you choose to implement depends on how comfortable you (or the people implementing the engine) are with distances in N-dimensional spaces. Luckily, for speed reasons, the distance table can be completely precalculated, so the engine doesn't have to throw in the speed hit of transcendental functions in each method dispatch. When multimethods are defined, yes, but that's pretty rare relative to the number of times they're used.

Earlier I said that we generally dispatch on type, but we don't have to, and that's true. Type-based MMD is done mainly in the interest of speed--it's usually very quick to check the types of your parameters and look up in a big table to see which 'real' method you want to dispatch to, but you certainly aren't limited to that. Haskell, for example, allows you to dispatch based on the values that are passed, as well as their types. This is a tremendously powerful mechanism, though one that's not often implemented. (I have a nagging suspicion that's as much because it becomes close to impossible to mathematically prove a program's correctness with this scheme as the cost, and the folks doing language research tend to be of a theoretical bent. I may be wrong, though)

So you could certainly have something that looks like:

  multi foo(int i < 5) {
    print "Smaller than 5!";
  }
  multi foo(int i >= 5) {
    print "5 or bigger!";
  }

in your program, if your language supports it.

The implementation issues are a bit of a pain, and while implementation issues shouldn't always get in the way, an explanation is generally warranted.

In a pure type-based MMD system, all we really need for each sub/method that can do MMD is an N-dimensional table, one dimension per parameter that participates in dispatch. Each type has a number assigned to it, so at dispatch time we grab the numbers, index into the table, and dispatch to the resulting method. It's quick, though in its naive implementation it can be pretty memory-hungry. (There are optimizations for that, of course) We can even take advantage of inheritance to find the best match if we don't have an exact match.

In a value or expression based MMD system, it's not that easy. We need to evaluate the parameters to get their values, then run through all the expressions in the type signatures until we find one that matches. This can potentially take quite a while, relatively speaking. It also requires evaluating the parameters to the function at least once for the dispatch, which may cause an unusual and potentially unpredictable number of fetches of active data. It is, though, a much more interesting dispatch mechanism in a loosely typed language like Perl.

Now, in more concrete news, Perl 6 will support type-based MMD. That's no big deal. Any language that uses Parrot's native dispatching will also get the benefits of MMD, though whether you can actually declare a sub or method as MMD in that language is a separate question. (One who's answer is "No", I expect) I don't know if Perl 6 will support value-based dispatch, but it could. Parrot, however, will. Or, more to the point, Parrot will support arbitrary fallback lookup mechanisms. The default will be MMD type-based lookup, but we could, and probably will, provide a MMD value-based lookup that could alternately be swapped in, since it's not any more trouble and, more importantly, has no particular overhead, as it's just one more potential vector off an inflection point (fallback dispatch) that we need to make indirect and overridable anyway, and we can add as many different ways as we like without additional overhead.

So... Even if Perl 6 doesn't do it, and Parrot doesn't provide it by default, you (or someone Altogether Too Clever) could add it in after the fact, try out some different syntaxes, and work something out for Perl 6.2. The nice thing about it is that, since MMD is already going to be in, it's not like value-based MMD will be adding in a new concept, it'll just be adding in a new variant of an existing concept, which is a much easier thing.

Posted by Dan at 02:29 PM | Comments (5) | TrackBack

May 27, 2003

What the heck is: Multimethod dispatch (MMD)

AKA Signature-based dispatch, and a number of other names.

This is a way for your program to have multiple subroutines, functions, or methods with the same name that differ only in the types of their signatures. That means you can have a subroutine foo that takes an integer, one that takes a string, and a third that takes a pointer to a structure of type bar. Same name, different parameter types. The compiler or your runtime then figures out which one to use. With statically typed languages it can be done at compile time, with dynamically typed languages it's done at runtime.

In fact, your code might look like:

sub foo(int a) {
print "An integer!";
}

sub foo(float a) {
print "A float!";
}

sub foo(string a) {
print "A string!";
}

What happens is that, when your code calls foo(), something (either the compiler or the runtime) takes a look at its parameter and decides which of your foo subs should be called. If the parameter's an int, the first foo is called. If it's a float, the second function is called, and if it's a string, the third function is called. This can be something of a disconcerting thing, if you're not used to it.

Now, before we go any further it's important to note that almost all languages that let you do this either require some sort of notation on the sub or method declaration, or have multimethod dispatch as a very fundamental construct, so either you've asked for it or you're expecting it. It's not, generally, something that springs itself on you. (If you were worried about accidentally doing this)

Why would you do this? Well...

If you've worked with a dynamically typed language, you've probably written code that looks something like:

method foo (object bar) {
if (bar.isa("integer")) {}
elsif (bar.isa("string")) {}
elsif (bar.isa("Thing")) {}
elsif (bar.isa("Widget") {}
}

where your method takes an object as a parameter, and at runtime you figure out what type of object it is. If you've done that, you've undoubtedly messed it up once or twice, and you've probably also used other people's code that does this and you've wanted to add a path for your object type, but couldn't.

If you've done operator overloading, you've probably also done MMD. Operator overloading (really a subject of another WTHI) is more or less when you can define how some basic operator such as addition or multiplication works for your class. Whenever you have this sort of overloading you're pretty much forced to do some sort of MMD if you want to Do The Right Thing, and that means checking the type of the other operand. Multiplication with an integer on the left and a matrix on the right does something very different than multiplication with a matrix on both sides. Or it does if you implement any sort of standard multiplication. Languages that do overloading without doing MMD make writing proper overloaded operators a pain.

Even in a language that has no concept of multimethod dispatch, it's still there, at least a little. Consider the following snippet of C:

c = a + b;

No biggie, right? Nothing multimethod there. Well... Try it. The C compiler will emit different code depending on the types of a and b. If they're both integers you'll get one set, if one of them is a float you'll get another, and if they're both floats you'll get a third. Multimethod dispatch, based on type. You just can't get to it in C.

One big benefit to multimethods is that they are generally (though not always) considered to live outside any namespace or object package, so you can add in your own foo method to any class, and it Just Works. Or, more to the point, you can add your "multiply(int, matrix)" function to the mix without having to thump the int class. (if you even can) One other benefit, in an OO language, is that generally the compiler or runtime will do the best it can. If you have a foo method that takes two strings, and instead call it with two objects that are subclasses of strings, the compiler (or runtime) will walk the inheritance tree of the arguments looking for the closest match, for some version of closest, and call that. If one can't be found, you get an error.

Now that you know what they are, there's the why question. Why do this?

Well, efficiency is one reason. You can have different versions of a routine that are optimized for the different types of the parameters. As in the C compiler case, you may want to emit different code depending on the operand types, to go as fast as possible.

Speed of dispatch is another--odds are the MMD code in the compiler or runtime is very well optimized for speed, and likely will outstrip anything you can do by hand. Even if your own code is faster, the next rev of the compiler or runtime may beat it, which is how these things often go.

Bug reduction is another. It's a lot less error prone to have one chunk of code, provided in the compiler or runtime, to figure out what routine to call based on the arguments than it is to have each of a thousand subs and methods hand-roll the code at their start. It's dull, boring code and very easy to get wrong.

Correctness is yet another reason. If you want your language to behave properly in the face of operator overloading (and you might not, depending on how you feel about overloading) you have to allow for very different behaviour based on the argument types. Matrix * Matrix does something very different than Matrix * Integer. (And different still from Integer * Matrix, but that's another story) This isn't even any sort of "I'm doing something wacky with operators!" it's just correctness.

Since multimethods generally live outside any class it makes doing proper operator overload extensions easier as well. While I know it "breaks encapsulation to inject a method into another class" you really do have to in many cases to get it right. Consider the matrix example we've been batting around. Integer * Matrix is an important case, but most operator overloading systems check the left-hand side to see what they should do. Since the Int class probably came standard, it undoubtedly has no idea what a matrix is, or what to do with it. Your Matrix class, on the other hand, does know what to do, so having the int*matrix version in the matrix class makes sense. It also means that when you create your own Foo class, which also knows about matrices (though matrix has no idea about Foo) you can have it provide code to handle the matrix case. While you can't handle all eventualities, with MMD you can at least handle more.

MMD. Interesting, useful, and in many cases what you're doing already, only in a less sucky way.

Posted by Dan at 06:13 PM | Comments (12) | TrackBack

May 07, 2003

What the heck is: Continuation Passing Style (CPS)

Before you go any further, make sure that you understand, at least a bit, about continuations. I wrote about them earlier, and there's plenty of material around the internet to confound and confuse you on the subject. (And bemuse and horrify, but that's the 'net for you, I suppose)

So, we'll assume you understand continuations. As a refresher, they're a sort of super-closure, one that remembers both variables and the call chain. When you take a continuation you get a thingie that remembers all the lexicals that were in scope when it was created, as well as remembering the call sequence you took to get there. (And if the places you were had lexicals in scope, the continuation captures them too, since they'll need to be put back in place again if you use the continuation)

Now, consider the humble function call:

foo(1, 2, 3);

What happens?

Well, in a standard system, when your code gets to the function call it does a few things. First, it pushes an address onto the stack. Then it puts the parameters either on the stack or into registers, depending on how many registers your CPU architecture had handy when the calling conventions were established. Finally it calls the function.

When the function does its thing it yanks the parameters out of wherever they were (on the stack, or in registers) and then it does what it's supposed to. When the function is done and has to return, it puts its return value somewhere (if it has one) then pops the address off the top of the stack and jumps there. No biggie.

Got that? Caller pushes the return address, the address of the instruction to execute when the function call is done, onto the stack. Caller then calls the function. Function does its thing and, when its done, removes that return address from the stack and goes there.

It's actually pretty simple, and a lot of hardware actually has built-in support for this function calling mechanism--there's the jsr (and sometimes bsr) instruction that pushes a return address onto the stack and jumps into a subroutine, and the ret instruction that pops an address off the top of the stack and jumps there. Different CPUs may have different names for these operations (and some don't have bsr, just jsr) but they do the same thing, and they make subroutines easy for the assembly language programmer. Unless you need to pass parameters in and out on the stack, in which case they get really annoying (since the parameters are beneath the address on the top of the stack) but that's computers for you.

Now, you can probably see a potential issue here. "What," you might say, "happens if I make a massively nested and/or recursive set of sub calls?" Well, bucko, you blow your stack and crash. Or scribble over heap memory, which is arguably worse. Not fun, as you might expect, though in these days of 32 and 64 bit address spaces it's not as common an occurrence in non-threaded programs as it was back in the 8-bit days, when you had maybe 48K of RAM, total. (And, of course, we had to walk barefoot five miles up hill in the snow to pop the tape with our program into the 300 baud cassette loader. Ahem. Sorry, old fogey moment) Still, it can happen, and in circumstances where your stack is much smaller, such as when running with threads (you often get only 20-30K of stack space per thread) it can be darned common. It has the twin advantages of hardware support and conceptual simplicity, though.

The stack method of calling is a bit of a pain for some uses--when you had to use the stack to pass parameters you end up twidding what entries are where so you don't prematurely pop off the return address, or cover it with the return data.

You may be thinking "But what about lexicals? How does this interact with lexicals?" The answer is.... it doesn't. This is, remember, an old method of doing things, and it predates lexicals. (Yes, I know, it doesn't really, but it predates lexicals in common use. And no, Lisp at MIT doesn't count as common anything) Generally if you want to save your current lexical state, you push a pointer to your lexical pad onto the stack before you make a function call, and restore it after the call is done. If you're working with a language that doesn't do lexicals, as most don't, it's not a problem since they're just not around to worry about. (The problem with lexicals is that they can't be allocated on the stack if there are closures or potential closures being taken)

Continuation Passing Style, or CPS for short, is completely different. With CPS, no address is pushed onto the stack before a function call is made. Strictly speaking, the whole jsr/ret scheme isn't used at all. What happens with a simple function call like:

foo(1, 2, 3);

is a little different than what happens in the stack style. First, a continuation is taken, one that resumes just after the function call. (Remember that, in addition to the lexicals and call stack, continuations have to hold the place you are going to go if they are invoked) Then the parameters are put somewhere, often on the stack. One of the parameters, generally unspecified, is the continuation that was just taken. (This is the continuation passing part) Then we just jump to the start of the function.

When the function executes, first it takes the parameters and puts them wherever it needs to. Then it stores the continuation someplace. The function does its thing, whatever that is. When it's finished, it generally puts its return parameters someplace, and invokes the continuation that was passed in. Since that continuation, conveniently, puts us in the calling function at the spot right after the function call was made, with its environment intact--just like in the stack style.

The cool thing about using a CPS for function calls is that it makes taking a continuation much, much easier. When taking a continuation, you only have to take a snapshot of the current function/sub's lexicals and control stack--you don't have to care at all about the function's caller, or the whole stack or anything. Why not? Because when the function you were in was called, its caller conveniently took a continuation and passed it into you. You don't have to look outside your own routine when taking the continuation because the caller already noted that stuff for you. That makes taking a continuation much faster and simpler. It can even be optimized some--you're taking note of just the info in your own sub/function/method, so the compiler knows exactly what is and isn't important, so it can ignore anything that you're not using.

CPS also makes what're called "tail calls" (and tail recursion) much easier and faster. I'm going to talk about them in a later post, but consider this:

sub foo {
# Do some stuff
return bar();
}

That is, we have a sub that does stuff and, as its last two actions, calls a function and then exits with its return value. And also remember that foo got its caller's continuation passed in. And that we pass in the continuation of where the called function should go when it finishes. Which all adds up to, in this case, foo not bothering to take a continuation at all, just passing in the continuation it got into bar. Which is a nice optimization. (And in cases of recursion, avoids allocating a continuation structure per call, which if you're looking for the millionth prime number can be a not inconsiderable savings)

Unfortunately, CPS does have a down side. Two, actually. (Ignoring the whole "Aaaah! Continuations! Make the pain stop!" issue, since they're really not that bad)

The first thing that springs to mind is return values from functions. Since invoking a continuation is supposed to put everything back the way it was--stack and lexicals--you'd wonder where the called function is supposed to put them. Can't go on the stack, as in the stack system, since we put the stack back. Generally in a CPS system there's a single return value, and it goes into a register. (Registers, generally, aren't restored) This makes returning multiple values a bit tricky, though most languages that do CPS don't do multiple return values. But, then, neither do most languages that do a stack system. CPS is a bit of a hindrance for languages that do have multiple return values, since then you need to either have a special area that isn't restored on continuation invocation, or build a little stack of return values and pass back a pointer to it as your return value. (If you're an old hardware wonk you might wonder what you'd do in the case of a system, such as the 6502, where the registers are both too small to hold the largest data element the processor can handle and are too small to hold a pointer (all the registers are 8 bits, a pointer is 16). The answer is "You can't use CPS at all easily there. But you can play M.U.L.E., so it's a net win")

The second problem with CPS is that it involves continuations. Not in a terror-inducing way, but rather in a speed hit way. The one downside to supporting continuations is that it means you can't use the stack to hold both control information and data. So things like C's auto variables can't be put on the stack, which slows things down as you need to allocate a frame for them somewhere out of heap memory. While not slow, it's definitely much slower than just incrementing/decrementing (depending on which way the stack goes) the stack pointer by a few dozen words. If you're shooting for speed, well, you just missed a bit. If that's not a problem, or the semantics of the language you're using requires the heavier-weight variable allocation system (or, like Lisp, has so few variables in use that it doesn't matter much how long it takes to allocate them) it can be a useful win.

Still, in a register rich system CPS is definitely pretty darned cool I'll admit, after this, I'm almost tempted to shift Parrot over to a full CPS system. Almost....

Posted by Dan at 09:30 AM | Comments (4) | TrackBack

May 04, 2003

What the heck is: The pending topic list

Figured I'd add this as a list of the topics I've got on the list to talk about at some point. If there's something you'd like that isn't here, add it in as a comment, please.


  • Tied variables
  • Tail calls
  • Garbage collection and Dead Object Detection (Yes, they are separate things. Really)
  • ASTs and Interrupt routines
  • Async I/O
  • Threads
  • Registers
  • Exceptions
  • JIT
  • CPS (Continuation Passing Style)

Not, mind you, necessarily in this order. The discussions will be both an explanation of what the thing is as well as why you'd use it. (Which is why threads are on the list, for example, though folks probably know what they are)

Standard "I'm a grumpy engineer and hardware guy" perspective applies, of course. :)

Posted by Dan at 04:28 PM | Comments (10) | TrackBack

May 01, 2003

What the heck is: a coroutine

Well, OK, grammatically it ought to be "what the heck are: coroutines" but we'll make do for the sake of easy searching. (I probably ought to see about putting categories in, but I don't know that it's that important)

Anyway, what is a coroutine? Well, a coroutine is a subroutine or function that can pause in the middle and return a value. When you call it again it picks up right where it was when you paused, environment intact, and keeps on going. Coroutines are almost always closures, though strictly speaking they don't have to be--if you had a language that didn't have closures, or lexical variables, you could still have coroutines. (Though generally speaking nobody does that. You could, though, if you were so inclined) Often coroutines are called iterators or generators, but that's not strictly correct--while you can make iterators and generators with coroutines, you don't have to. Nor are coroutines limited to being iterators or generators, as some systems use coroutines to implement threading, and coroutines can make really nifty callbacks.

From the outside you can't tell if a function is a coroutine any more than you can tell that a function is a closure. Your code calls it and gets back a value (or values) and that's all that's important. (Yay, encapsulation)

Like normal subroutines or closures, the first time you call a coroutine it starts at the beginning. If a coroutine hits a return, or falls off the end of the code (assuming you're using a language that supports just falling off the end of a function) the next time you call into the coroutine it also starts back at the beginning. In this regard it's no different from any other subroutine or function. It's when you call yield (or your language's equivalent), though, that things get interesting.

yield, like return, returns both a value and control to the code that called the coroutine. What it does, though, is remember where in the coroutine's code it was, so that the next time the coroutine is called it picks back up right where it was the last time you were in the coroutine.

For example, if you have this code:

  coroutine foo {
    yield 1;
    yield 2;
    yield 3;
  }
  print foo(), "\n";
  print foo(), "\n";
  print foo(), "\n";
  print foo(), "\n";
  print foo(), "\n";
  print foo(), "\n";

prints

1
2
3
1
2
3

Now, you may ask "What about all those pesky parameters? What happens with them?" Well, the answer is "It depends".

Coroutines can certainly take parameters. While they wouldn't be completely useless without them, they certainly would be less useful than they might otherwise be. How languages handle parameters and coroutines can differ, though. The issue isn't passing parameters in the first time you call a coroutine, as that's easy, but rather what happens when you call a suspended coroutine with potentially different parameters than you called it the first time. Languages generally have three different things they do in this case. Let's alter our coroutine a bit:

coroutine foo (int a) {
yield a;
yield a + 1;
yield a + 2;
}
print foo(1), "\n";
print foo(2), "\n";
print foo(1), "\n";
print foo(1), "\n";

The first way is to just ignore passed-in parameters on the second and subsequent invocations. Yeah, it's nice that you called the coroutine with 1 the first time and 2 the second time, but the coroutine was happy with 1 and it'll stick with that. If you happen to call it when the coroutine's reset then the parameters will be respected. So in our example above, it would print "1 2 3 1".

The second way is to remember what the coroutine was invoked with and create a new instance of the coroutine if you pass in different parameters. Whether the language retains multiple versions of a coroutine depends on the language--some do, some don't. Our example would print "1 2 2 3" if the language does retain multiple versions of a coroutine, or "1 2 1 2" if it doesn't. (The change from 1 to 2 as a parameter would discard the old coroutine, as would the change from 2 back to 1, if the language discards on parameter changes)

The third way is to just pass in the parameter and alter the variable that it's tied to on each coroutine entry. (This only works if you're actually using the passed in parameters directly--in the case of perl, where everyone copies them out of @_, you'd probably not see the changes because you copy out of @_ at the beginning of a sub) So our example would print "1 3 3 1".

Which way is right? Well, they all are, really. Each way has its advantages and disadvantages, both from a language comprehension standpoint and an implementation standpoint. Whoever's designing the language decides which way they like best, or which way is doable with their current implementation, or which way grinds the right ideological axe, and goes with that.

One thing to remember is that you can yield out of a coroutine no matter how deeply you're into it. Our examples are really simple, but you could be nested ten or twenty levels of scope deep in a coroutine and still yield out--when you re-invoke the coroutine you'll be dropped back where you were, ten or twenty levels deep, with all your lexicals put back in place. (If you're using a language without lexicals it makes things easier)

Co-routines are definitely very powerful and often are used as iterators, tied to some data structure where re-invoking the coroutine walks through the structure a step, or generators, where each invocation of the coroutine returns the next value in a series. It can be much easier to do both when you can retain both variable and position in code state, depending on the type of iterator or generator. Closures are often sufficient for simple iterators and generators, but sometimes just retaining data isn't sufficient, or if it is sufficient it's quite inconvenient to make sure you retain enough data to pick up where you left off.

The big downside to coroutines is in their implementation. To make coroutines work properly requires relatively complex calling conventions and state keeping, though not really any worse than what you might need if an engine supports continuations, so if you have those it's not much more work. It really wants chained call frames rather than a call stack and doesn't want parameters passed on the stack, amongst other things.

Implementation, though, is a mildly complex thing, and discussion is definitely a task for another day.

Posted by Dan at 04:47 PM | Comments (10) | TrackBack

April 26, 2003

What the heck is: Walking the stack

Since I seem to end up going on about a variety of less well-known computer tricks, I figure I might as well make it a semi-regular feature. So welcome to the first official entry in the "What the heck is" series. :)

These are, or will be, questions that crop up as part of Parrot development. They're the sorts of things that you might run across when writing interpreters, OS kernel code, language compilers, and other relative esoterica. These are things that nobody (well, to a first approximation, at least) does, so the concepts are at best really fuzzy for most folks, and in many cases (such as with the continuation) completely unknown. So, rather than just grumbling about the sad state of people's knowledge about low level hardware and software concepts, I should try and do something about it.

Today, we're going to talk about walking the system stack, something that comes up with garbage collection. (I should talk about garbage collection. Later. Hopefully I'll remember to make a link here)

The system stack, for folks that aren't familiar with it, is just a chunk of memory that your machine's CPU uses to hold temporary values. There's usually a CPU register dedicated to it, either by convention or as part of the hardware itself, so access to the stack is fast, something that's generally considered a good thing.

Walking the system stack is a process where you figure out where the stack starts and ends and looking at all the data on it. Not necessarily changing the data, mind (as it's often write-protected, at least in part) but just looking at it. This is generally considered mildly evil, in part because there's no use you can make of the data on the stack that doesn't break encapsulation, any data hiding, or code modularity. Still, there are often good reasons to do it.

Garbage collectors (or GCs), for example, are one of those reasons. The whole purpose of a GC is to see what data is in use and what isn't. That means if there could be a reference to data on the stack, perhaps because the language accessing the data uses the system stack for data storage. (C does this, for example, as do nearly all the compiled languages that don't support closures) You'd really hate to clean up after a variable you thought was dead because you didn't look at all the places that the variable could be referred to from.

How does one walk the stack? Well, you need to get the base of the stack, the spot where the stack is as empty as possible. Depending on the OS and CPU architecture, there may be an easy way to do this. If not, what you can do is get the address of a variable that's been allocated on the stack at the very beginning of your program, at a point where there can't possibly be anything of interest on the stack. Then, when you want to walk the stack, you just get the current stack pointer. Everything between the two is your stack data, and you just run through it like any other array of binary data. Depending on what you're looking for you may be able to cheat a bit--many systems require that stack data be aligned, so that a 4-byte pointer must be on a 4-byte boundary, which reduces the amount of data you have to look at. (And other systems don't put any requirements at all on the stack data, which makes things a bit of a pain)

And that's it. Nothing at all fancy--walking the stack is just a matter of figuring out where and how big the current stack is, and grovelling over it. A trick you should almost never do, but when you need to, well, you need to.

Posted by Dan at 05:04 PM | Comments (2) | TrackBack

March 31, 2003

Continuations and VMs

Part 3 (well, 3a, since I haven't gotten to the VM bits) in a series of "Why the JVM and .NET have issues with some classes of languages" posts. This one's about continuations.

Now, before I go any further, do note that what this bit talks about does not affect python, perl 5, or a host of other languages. They don't have continuations, which is fine. It does affect Perl 6, Ruby, Scheme, Lisp, and a few others, though I don't know that anyone's going to be porting unlambda to .NET. But, then, I didn't expect a befunge or BrainF*** interpreter for Parrot, either. Nor Threaded INTERCAL, for that matter.

Anyway, a continuation is essentially a closure that, in addition to closing over the lexical environment, also closes over the control chain. (Well, OK, control stack, but if you're doing continuations odds are it's a singly linked list rather than a real stack) CS texts generally go on about it continuations being "the rest of the program" or "gotos with arguments" or suchlike stuff. If those float your boat, great--they never made sense to me.

I doubt that a single phrase is suitable description--if it was far more people would understand the things--so it's time for some text.

Assume, for a moment, that we have a language, like Perl (or C, for that matter) that has lexical variables. Each block has some variables attached to it, and (unlike C in practice, though not in theory) those variables are each stored in a separate data structure somewhere. A scratchpad, if you like. Those pads are linked together, so an inner block's pad has a link to the immediate containing block's pad, and so on. In this example:

{
  my $foo;
  {
    my $bar;
    {
      my $baz;
      return sub { eval $_[0] };
    }
  }
}

three blocks, three pads, and each block's pad is linked to the containing block's pad. So the pad with $baz in it has a link to the pad with $bar in it, and the pad with $bar in it links to the pad with $foo in it. Got that? Good. If we made a closure inside the inner block (like the return does--returning an evil closure but that's neither here nor there) then that returned closure object has a link to the innermost pad, and through the link chains all the pads out to the top level of pads. When we call into the closure, it has access to all the lexicals that were in scope when the closure was created.

Got that? Closures are subroutines that capture their lexical scopes. When you call them they (temporarily) restore their lexical scopes.

Now, think about how control flow works in a program. Whenever something "controlish" happens--you make a function call, an exception handler is established, a new lexical scope is put in place--a marker for that event is put on the control chain. When you leave the scope of the item on the chain you remove it. (And removing it might have some action associated with it--when the runtime removes the element for a function call it fetches the return address out of it)

So, we have this control chain. It's a singly linked list (conceptually) of control elements. We can capture it any time we want, if we've got a mechanism to make sure what we've captured doesn't get altered. (Making the chain elements copy-on-write, or just cloning them all off, both work) We also have a closure--a sub that has a chain of lexical scratchpads. And if you take that one step further, we can divorce the lexical pads for the closure from the closure itself, leaving us a chain of control and a chain of lexical variable scopes.

Now...

Imagine what would happen if we bound together that chain of lexical variables, the chain of control, and an address into one thingie, such that when we invoked that thingie we'd jump to the saved address, put in place the associated control chain (discarding the current chain), and put in place the lexicals?

Well... we'd call the thingie a continuation. And that's what it is.

Now, in practice continuations aren't made of a random lexical chain, control chain, and address. When you make a continuation you capture the current control chain and lexical chain, and generally the current or next instruction in the instruction stream, and they're often bound together with special function calls ("call-with-current-continuation") but they don't have to be.

One of the cool thing with continuations is that, since they're supersets of closures, the variables they capture keep their values from invocation to invocation of a continuation, just like variable values persist across multiple invocations of a closure. And like multiple closures that close over the same scope, multiple continuations that share scope will see variable changes that other continuations that capture the same scope make.

And, just to top it off, someone (no, I don't know who, so this might be CS Legend) once proved that you can build any control structure you can think of with continuations. Though often you shouldn't, but that's a separate thing. Exception handlers are conceptually a form of continuation, for example. When the handler is established a continuation is taken, and when an exception is thrown you just invoke the exception handler continuationa nd *poof!* you're at the spot in the code that represents the exception handler, complete with its lexical scope.

There's a lot of other evil stuff you can do, too--there's no reason, strictly speaking, that the destination you jump to when invoking a continuation has to be anywhere near where you were in the program when the lexical and control chains were captured (and they don't have to be related either). Granted, doing Weird Evil Magic like this is a lot of work, but hey, if it was easy it wouldn't be fun. :)

Posted by Dan at 01:55 PM | Comments (6) | TrackBack

March 27, 2003

(closures) The reason for Parrot, part 2

Okay, the previous entry wasn't titled part 1, but...

Anyway, I promised Jon Udell more of an explanation as to why perl's not a good fit for .NET and the JVM. This applies to Python and Ruby too, in their own way, though I'm not going to speak directly for them.

This time around, let's talk about closures and their relationship to lexical variables.

A closure is reasonably simple, at least in perl terms. It's a subroutine that captures its environment. Generally they're anonymous, and generated at runtime. A sample might look like:

  sub make_closure {
    my $foo = 0;
    return sub { return ++$foo; }
  }

This sub is desperately simple, and all it does is return a reference to an anonymous subroutine that will return a monotonically increasing integer, one per invocation of the subroutine. The anonymous sub is a closure because it closes over its environment, capturing what was in scope at its creation and remembering. Well, OK, arguably that's not desperately simple, so here's how it acts. If you had code that looks like:

  $sub_ref = make_closure();
  print $sub_ref->(), "\n";
  print $sub_ref->(), "\n";
  print $sub_ref->(), "\n";
  print $sub_ref->(), "\n";

It would print out the sequence 1, 2, 3, 4, each on its own line.

How it works is pretty simple. The first line calls make_closure, which returns a reference to an anonymous subroutine which increments and returns the value of the $foo in scope when the anonymous sub was created. The next four lines call it and print the results.

That's not the interesting part. The interesting part is code that looks like:

  $sub_ref = make_closure();
  $another_ref = make_closure();
  print $sub_ref->(), "\n";
  print $sub_ref->(), "\n";
  print $another_ref->(), "\n";
  print $another_ref->(), "\n";

because this prints out 1, 2, 1, 2. The reason this happens is that when make_closure is called, it instantiates a fresh version of $foo each time. Not too surprising, as this is what you'd expect from a sub--it instantiates the lexicals that belong inside it. The sub { $foo++ } part returns a reference to an anonymous subroutine, which for perl is inherently a closure, that accesses $foo. $foo is a lexical declared inside of make_closure, so since it's created anew each time make_closure is invoked, each new anonymous sub references a new version of $foo. There' s no collision there.

How does this complicate, and slow down, the system?

Well, remember how variable allocation works in non-closure languages, such as C. Ponder the following C function:

  *int foo() {
    int bar = 1;
    return &bar;
  }

What this function does is allocate a variable bar, give it a value of 1, and return a pointer to it. If you've ever done this in C, or languages that allocate lexicals like C does, you know what happens--the variable whose address this returns will soon turn to mush. Why? Because of the way C handles space allocation for lexical variables.

What C does is to use a stack, generally the system stack, to store lexicals. When you enter a C function, the function preamble reserves space for all its variables by adjusting the top of stack pointer a bit. The lexical variables are all in that chunk of stack that was just reserved. When the function exits, the stack pointer is readjusted and the stack space is freed up for the next function call.

Obviously, for closures to work, it means that lexical variables, what most folks think of as stack-allocated variables, can't be allocated on the stack. Or if they are, it means that the stack has to be a series of call frames, with each call allocating a new frame. Otherwise the variables the closure captures would just keep getting stomped on.

This slows down function calls in languages that support closures. In a non-closure language, all you have to do to reserve space is twiddle the stack pointer. Often there's not even a check to see if there's enough space on the stack--if too much space is reserved you fall off the end of the stack into unmapped memory, your program segfaults, and that's that. Leaving the function means you just decrement the stack pointer and that's that. (In a language with variables that have active destructors, like C++, destructors may get called)

For a language with closures, calling a function requires actively allocating a chunk of memory from the heap. While not horribly expensive, it does cost a lot more than just incrementing the stack pointer. Feeing up the frame for the lexicals also has to use the heap freeing system, which is also more expensive than just decrementing the stack pointer. A language with closures makes a garbage collection obligatory as well, though since the JVM and .NET already do this at least that's no extra expense.

Is this cost huge? No. But it does make function calls more expensive, which can be felt if you're using a system with a lot of small leaf functions, a common scheme in OO programming.

Closures are terribly useful, but there is that cost. If your language uses them, then of course the cost is justified, since it gets you the feature (closures) that you want. If you're working in a language that doesn't do closures the cost is useless overhead--you pay it because you must pay it (there's no good way to optimize the cost away) but you get no benefit from it.

That's why it's silly for the JVM or .NET to add support for closures. Java doesn't do closures, and none of the current .NET languages do closures. Wedge them into the engine and all of a sudden all the programs running on those VMs suddenly run slower, but get no benefit from it. Could they add that feature in? Sure. But why? To run perl or ruby? Is that really their target audience, and do they (Sun and/or Microsoft) really get any benefit from it?

Personally, I think not and not, respectively.

Posted by Dan at 11:09 AM | Comments (13) | TrackBack