October 31, 2004

spam -- it's not just for breakfast any more!

Nor in english, not that this is any great surprise to anyone. (And it looks like nearly half the spam my filters catch isn't Latin-1/Unicode. I don't know if this is an argument for or against dumping all the non-Unicode encodings... :)

This piece actually made it through the filters, which was mildly interesting. I'm not 100% sure it actually is spam (my Japanese isn't very good) but after a half hour or so with the dictionary and grammar reference it sure looks like it. I expect I'll poke at it some more, but on the off chance I'm really missing something, anyone care to give it a quick read? I'd hate to misread an actual offer of help. (Though I'm really thinking that this... isn't)

[message chopped out -- yep, it was spam]

(Quick update -- this is, in fact, Unicode, though it's not getting tagged properly for some reason. Text encodings, all of them, suck)

Posted by Dan at 12:42 PM | Comments (5) | TrackBack

October 21, 2004

What the heck is: Finalization

Chris Brumme made a blog posting (ages ago--this has been sitting in my pending queue for a while) that reminded me about this, so I thought I'd go into it before I forgot. (I'd recommend reading that link, too--while it deals only with finalization in a .NET environment, and Microsoft's .NET environment specifically (Mono and dotGNU may well have different details) it gives a good overview of some of the more... interesting issues that can be brought up by finalization)

Anyway, finalization is the process of letting an object that's now dead have one last shot at cleaning up after itself. Finalization is not the same thing as destruction, though the two terms are often used interchangeably, and in many systems they occur together. For the record, while finalization is letting an object clean up after itself, destruction is the system managing the object reclaiming the resources it uses. If you want a concrete example, consider the humble filehandle object. This is an object that represents a file. Moreover, it automatically flushes the buffers and closes the file when the filehandle is no longer referenced. Not unusual behaviour for a filehandle. (Well, at least not in perl. Other languages may vary) The finalization for that object is the closing of the underlying OS file. The destruction of the object is the system deallocating the memory for the buffers and putting the now-dead object on the object free list for later reallocation. Almost all object systems allow you to have one or more finalization method for an object. These finalizers are optional.

So, when system decides the object needs to die the finalizer is the routine that gets called to do any last gasp cleanup.

Simple, right? Well... for you maybe. Finalizers are one of those things that give folks doing VM and runtime design absolute fits, at least when they're coupled with automatic memory management.

In a language like C++, where objects only die when the code tells them to die, things aren't too bad. (Though there are still issues, or so I'm told) With a system that does more active garbage collection, though, things get nasty. You have issues of reanchoring, finalization time, finalizer object usage, resource allocation, and environment availability. Sort of. Especially when combined with a need for speed and efficiency.

But automatic memory management is so useful that the problems are generally worth it, especially in a multithreaded system where the nondeterminism gets so bad there's no sane way to do your own memory management. (Not that writing a GC for a threaded system's at all easy, but that's a separate problem) Different languages solve the problems in different ways, based on system requirements, the amount of work someone was willing to put into the system, or how much of the problem the designer ultimately understood (or was willing to allow that app programmers would understand). Still, you've got problems.

The Problems, in no particular order

Before going further, it's worth noting that not all these problems affect all systems. Some of them (like reanchoring) are inherent in finalizers, while others, such as resource constraints, are issues because of choices made when designing the GC system. Depending on how the system you're using is implemented you may only have to deal with some of these problems.

Reanchoring

Reanchoring is when an object's finalizer brings it back too life. For example:

 FINALIZE {
   a = global 'Foo'
   a[12] = self
  }  

That is, the finalizer for the object goes and looks up a global array and sticks itself into that array. That makes our dead object... not dead. It's now anchored, and if the code that handles calling the finalization doesn't notice the object'll get deallocated and the memory thrown into the free pool, and now a[12] holds garbage. Not good, as you might imagine, and detecting it can be tricky in a system that doesn't use refcounts to track object usage. Or expensive. Sometimes both. (The 'easy' way is to have a "mostly dead" flag you set on objects with finalizers and, if after the finalizers have run, the object is still unreachable then you reclaim it, or use reclaim queues)

And, of course, you have issues of safety -- can you actually reanchor an object in the middle of finalization? Often you can't, since the object may well be partially destroyed. This'll happen in those cases where several of an object's finalizer methods have fired and then one of them decides to reanchor. (Since you're firing off all the finalizers in a class' hierarchy -- OO encapsulation makes it such that you really do need to execute all the finalizers the same way you need to execute all the initializers)

Of course, actually catching a reanchor's a tricky thing too, potentially fairly expensive. You almost want to wrap all visible global objects in virtual saran wrap, so they can be looked at but not touched. Which isn't easy.

Finalization Time

Finalization time is an issue for realtime systems, but can still crop up other places. This is where the finalizer takes a large, potentially unbounded, amount of time to do its finalization. (Or, worse yet, just up and blocks) Not too big a deal for realtime systems, since if you're working real time you're taking all sorts of precautions anyway, but still... a pain.

The other issue with long finalizers is that generally all other threads in your process trying to allocate resources will be blocked until the finalizer finishes. Finalizers run when objects are being cleaned up, and that generally happens because an allocation has failed. If you have a single finalization thread and multiple 'real' threads (that is, threads actually running the program, as opposed to housekeeping threads like one running the GC) you can stall a good portion of your program unpredictably, which isn't a good thing.

Finalizer object usage and resource allocation

One 'fun' issue with finalizers is that they're generally resource-constrained. That is, they have only a limited amount of memory or free objects to access, with that limit often approaching zero. Not too surprising -- the garbage collection run that found the dead objects needing finalization was likely triggered by resource exhaustion. (Not always, of course, since at this point pretty much everyone that doesn't use refcounts does some sort of continuous collection) This makes for fun coding, since it's tough to do much in languages with finalizers that doesn't involve allocating something. (Don't forget, in OO languages your temporaries are all objects, likely dynamically allocated)

Environment Availability

When finalizers are run they're often run with some restrictions. In systems with GC running in a separate thread there are sometimes a lot of restrictions. If you've a language that guarantees thread-local data, well... you don't have access to it in the finalizer, since you're in a different thread. Some languages place restrictions on what each thread can see or touch across threads. And even in languages that don't mind you've issues where potentially what was a single-threaded program is now a multi-threaded program, and you either need to have full synchronized access to your data or get ready to core dump.

Even without a separate GC thread, often the GC system is running in what's essentially a fragile state, akin to the environment that a Unix signal handler or VMS AST handler has. There are many system and library calls you just can't make, because you can't be sure of the state of the internals of the library or system. (If a GC sweep was triggered because you ran out of something, you might find you ran out in the middle of doing something that's not re-entrant, with data structures half-modified)

All of this is why people writing VMs and object systems in general have a hate/hate relationship with finalizers. If only the darned things weren't so useful...

Posted by Dan at 01:57 PM | Comments (9) | TrackBack

October 07, 2004

Damn, I miss New York sometimes

How can you not?

Posted by Dan at 02:06 PM | Comments (0) | TrackBack

October 06, 2004

A vast expanse of nothingness

That's what I get for being too darned busy -- MT eventually ages things off to nothing. Gotta dig into CSS at some point to see if I can't get it to keep the right column of stuff over on the right hand side, rather than do that nasty slide-over thing. (The one that's especially bad if there's more stuff in the side column than in the main text)

Anyway, for those keeping track at home, last weekend's workshop went reasonably well. There were 20-25 people if I counted right, without too many people bailing by the end. It may not have been what everyone was expecting, as it was lighter on the perl & python than I think was generally figured. Still, went well. I'm now pretty sure that a lot of parrot's fundamentals are sound, and I'm equally sure that there's a lot of stuff we still need to work on. Some of it's arguably library code, like IO, but we've got a really fuzzy core/library boundary. (Which, bluntly, is the way it ought to be. The distinction is far less necessary when dealing with an all-software setup than it is with hardware)

Given time, I think I could easily work up a full week "programming parrot" class. The slides I had could've been done in 6 hours rather than three with no loss of anything. With more explanations, examples, and exercises a 5 day class wouldn't be at all out of line. You, too, can learn what you need to write a compiler for your own in-house 4GL-from-hell that you're looking to replace with something that doesn't suck! Which is a task not to be over-looked, BTW. A lot of places have a large mass of code in one dead language or another, and being able to keep that code base going while allowing callouts to other languages is a really useful thing. (Imagine taking all your in-house RPG or MUMPS code and being able to refactor it to perl or python, or keep it as-is and do new work in something else)

Parrot, meanwhile, rolls on. There should be a 0.1.1 release this weekend. 0.1.0 was done back in the early spring, and a lot's happened since then. We don't think about it much, since we're all used to just sync-ing up to the CVS server (which anyone can do -- there's full anon CVS access and rsync access to the repository) but a lot of folks like having a stable release. So... 0.1.1. A lot's happened since 0.1.0, so it ought to be fun to read the release notes.

I've also done a little bit of maintenance on the back end here. All posts older from 2003 and before have been closed to comments. I got smacked by what amounted to a DDOS attack today, with spam posts hitting about every 20 seconds. MT-Blacklist caught them all (which was really nice) but between the load it places on the server and the load from the constant stream of inbound viruses and spam (which are also caught by ClamAV and SpamAssassin) the box got tipped over into swap and performance was shot to hell. Which I can generally cope with, but the box is the NAT gateway and name server for the machines at the house, so there wasn't much web surfing and mail pickup going on. That needed fixing. (At some point soon I'm going to throw another half-gig of memory into the server box to help cope with the next inevitable wave of internet vandalism or crapstorm)

Posted by Dan at 01:42 PM | Comments (4) | TrackBack