October 21, 2004

What the heck is: Finalization

Chris Brumme made a blog posting (ages ago--this has been sitting in my pending queue for a while) that reminded me about this, so I thought I'd go into it before I forgot. (I'd recommend reading that link, too--while it deals only with finalization in a .NET environment, and Microsoft's .NET environment specifically (Mono and dotGNU may well have different details) it gives a good overview of some of the more... interesting issues that can be brought up by finalization)

Anyway, finalization is the process of letting an object that's now dead have one last shot at cleaning up after itself. Finalization is not the same thing as destruction, though the two terms are often used interchangeably, and in many systems they occur together. For the record, while finalization is letting an object clean up after itself, destruction is the system managing the object reclaiming the resources it uses. If you want a concrete example, consider the humble filehandle object. This is an object that represents a file. Moreover, it automatically flushes the buffers and closes the file when the filehandle is no longer referenced. Not unusual behaviour for a filehandle. (Well, at least not in perl. Other languages may vary) The finalization for that object is the closing of the underlying OS file. The destruction of the object is the system deallocating the memory for the buffers and putting the now-dead object on the object free list for later reallocation. Almost all object systems allow you to have one or more finalization method for an object. These finalizers are optional.

So, when system decides the object needs to die the finalizer is the routine that gets called to do any last gasp cleanup.

Simple, right? Well... for you maybe. Finalizers are one of those things that give folks doing VM and runtime design absolute fits, at least when they're coupled with automatic memory management.

In a language like C++, where objects only die when the code tells them to die, things aren't too bad. (Though there are still issues, or so I'm told) With a system that does more active garbage collection, though, things get nasty. You have issues of reanchoring, finalization time, finalizer object usage, resource allocation, and environment availability. Sort of. Especially when combined with a need for speed and efficiency.

But automatic memory management is so useful that the problems are generally worth it, especially in a multithreaded system where the nondeterminism gets so bad there's no sane way to do your own memory management. (Not that writing a GC for a threaded system's at all easy, but that's a separate problem) Different languages solve the problems in different ways, based on system requirements, the amount of work someone was willing to put into the system, or how much of the problem the designer ultimately understood (or was willing to allow that app programmers would understand). Still, you've got problems.

The Problems, in no particular order

Before going further, it's worth noting that not all these problems affect all systems. Some of them (like reanchoring) are inherent in finalizers, while others, such as resource constraints, are issues because of choices made when designing the GC system. Depending on how the system you're using is implemented you may only have to deal with some of these problems.

Reanchoring

Reanchoring is when an object's finalizer brings it back too life. For example:

 FINALIZE {
   a = global 'Foo'
   a[12] = self
  }  

That is, the finalizer for the object goes and looks up a global array and sticks itself into that array. That makes our dead object... not dead. It's now anchored, and if the code that handles calling the finalization doesn't notice the object'll get deallocated and the memory thrown into the free pool, and now a[12] holds garbage. Not good, as you might imagine, and detecting it can be tricky in a system that doesn't use refcounts to track object usage. Or expensive. Sometimes both. (The 'easy' way is to have a "mostly dead" flag you set on objects with finalizers and, if after the finalizers have run, the object is still unreachable then you reclaim it, or use reclaim queues)

And, of course, you have issues of safety -- can you actually reanchor an object in the middle of finalization? Often you can't, since the object may well be partially destroyed. This'll happen in those cases where several of an object's finalizer methods have fired and then one of them decides to reanchor. (Since you're firing off all the finalizers in a class' hierarchy -- OO encapsulation makes it such that you really do need to execute all the finalizers the same way you need to execute all the initializers)

Of course, actually catching a reanchor's a tricky thing too, potentially fairly expensive. You almost want to wrap all visible global objects in virtual saran wrap, so they can be looked at but not touched. Which isn't easy.

Finalization Time

Finalization time is an issue for realtime systems, but can still crop up other places. This is where the finalizer takes a large, potentially unbounded, amount of time to do its finalization. (Or, worse yet, just up and blocks) Not too big a deal for realtime systems, since if you're working real time you're taking all sorts of precautions anyway, but still... a pain.

The other issue with long finalizers is that generally all other threads in your process trying to allocate resources will be blocked until the finalizer finishes. Finalizers run when objects are being cleaned up, and that generally happens because an allocation has failed. If you have a single finalization thread and multiple 'real' threads (that is, threads actually running the program, as opposed to housekeeping threads like one running the GC) you can stall a good portion of your program unpredictably, which isn't a good thing.

Finalizer object usage and resource allocation

One 'fun' issue with finalizers is that they're generally resource-constrained. That is, they have only a limited amount of memory or free objects to access, with that limit often approaching zero. Not too surprising -- the garbage collection run that found the dead objects needing finalization was likely triggered by resource exhaustion. (Not always, of course, since at this point pretty much everyone that doesn't use refcounts does some sort of continuous collection) This makes for fun coding, since it's tough to do much in languages with finalizers that doesn't involve allocating something. (Don't forget, in OO languages your temporaries are all objects, likely dynamically allocated)

Environment Availability

When finalizers are run they're often run with some restrictions. In systems with GC running in a separate thread there are sometimes a lot of restrictions. If you've a language that guarantees thread-local data, well... you don't have access to it in the finalizer, since you're in a different thread. Some languages place restrictions on what each thread can see or touch across threads. And even in languages that don't mind you've issues where potentially what was a single-threaded program is now a multi-threaded program, and you either need to have full synchronized access to your data or get ready to core dump.

Even without a separate GC thread, often the GC system is running in what's essentially a fragile state, akin to the environment that a Unix signal handler or VMS AST handler has. There are many system and library calls you just can't make, because you can't be sure of the state of the internals of the library or system. (If a GC sweep was triggered because you ran out of something, you might find you ran out in the middle of doing something that's not re-entrant, with data structures half-modified)

All of this is why people writing VMs and object systems in general have a hate/hate relationship with finalizers. If only the darned things weren't so useful... Posted by Dan at October 21, 2004 01:57 PM | TrackBack (0)

Comments

Now that you have outline the problems with finalization, perhaps a commonly used solutions (or even just Parrot's Solution) post would be enlightening... Please? and a ponie too?

Posted by: Matt Fowles at October 21, 2004 04:01 PM

Why exactly are finalizers useful? Are there any good reasons why a language should have them at all?

I only know a bad reason: because people want to deallocate resource in the finalizer. This is bad idea because there is no guarantee when the finalizer will be executed. Thus it can cause bugs that depend on the behaviour of the GC and are extremely hard to find. You could see deallocating in the finalizer as defensive programming, you improve the probability that a user notices a bug in your program. But it also reduces the probability that anyone will catch the (heisen-)bug...

Posted by: Tim Jansen at October 25, 2004 09:47 AM

I guess you know it anyway, but here's what Matz did for Ruby: he said that finalizers are just closures "attached" to the object, and the scope in which those closures were defined "should not" contain the object itself... i.e. you can't just reference "self" in the finalizer.

And he made finalizer syntax really clunky to discourage their use in Ruby, too.

That said, I don't ever use finalizers in Ruby. Closures do fine for scope-guard stuff like filehandles, mutexes etc.

Posted by: Vladimir Slepnev at October 25, 2004 02:49 PM

Finalizers are useful for cleaning up resources in the same way that garbage collection is good for cleaning up memory. While it's not as common to leak resources as it is to leak memory it still happens, and if your resources are first-class entities (such that you can store them in globals or pass them around as any other variable) it gets really tough to tell when, for example, a filehandle is no longer used or a database connection is unreferenced.

Finalizers let you attach death actions on those resoources. Rather than closing a filehandle at a fixed point in your code, you close it when the filehandle's no longer referenced. Yeah, you lose a certain amount of predictability--you're trading off knowing when a resource gets cleaned up for an assurance that it will get cleaned up safely.

And while there's no guarantee of when the finalizer gets run, and system worth anything will guarantee that they do eventually get run, short of a hard process-kill or process self-destruction. If you have finalizers on objects, the system has to run through those finalizable objects when the image exits.

Posted by: Dan at October 25, 2004 03:17 PM

If you rely on deallocating file handles in a finalizer, how can you be sure that this happens before the per-process file handle limit has been reached? If you deallocate database connections in the finalizer, how do you prevent the system from running out of TCP ports?

If you give up knowing when a resource gets cleaned up, you also risk losing the control over how many resources you allocate simultanously.

There are some cases for which a finalizer is a *possible* solution. For instance, if your whole program allocates only a single file handle. But there are better solutions that are simpler and less dangerous than finalization.

The right way to guarantee resource allocation is to rely on some longer running code entity that will clear up after the resource-using code entity finished. If you use closures for resource management, the function that executes the closure should clean up. If you use C#'s 'using' statement, the compiler generates code to clean up. In Java, you need to do it manually in a 'finally' clause. And external resources like network connections and file handles are deallocated by the operating system. All these solutions allow resource deallocation in a controllable and predictable way.

Automatic garbage collection would be only a viable option if the GC would be intelligent enough to know about a shortage of these other resources. Mis-using a memory garbage collection for collecting file handles is as smart as using a memory garbage collection that only runs when the system is running out of file handles.

Posted by: Tim Jansen at October 26, 2004 07:59 AM

Relying on finalizers to release resources is a spell for disaster. As soon as the objects creation rate is higher than som magic number -in relation to the available heap and number of other live objects- you will suffer resource exhaustion.
The (in my view) only sane way to use a finalizer in relation to scarce resource deallocation is to:
0) Try to ensure it's never really necessary by releasing resources when you don't need them any more.
1) If the resource is still owned by the object then print/log a big WARNING that the resource was released by a finalizaer
2) Release the resource
3) Take heed of the warnings and make sure they disappear

With heavy and thorough load testing this will make sure that your probability for resource exhaustion bugs is quite low. Corner cases can always exist but unless you are having very bad luck, the GC will be there to help you if you'd hit one of them in production...

Posted by: Mikael at October 26, 2004 06:23 PM

"And while there's no guarantee of when the finalizer gets run, and system worth anything will guarantee that they do eventually get run"

Java had runFinalizersOnExit() since the beginning, but it's deprecated now. What "systems worth anything" are you talking about?

I tend to think that finalizers are seductive but mostly useless (like resumable exceptions). There are much better mechanisms for resource management: Ruby uses closures, C++ uses RAII, etc.

Posted by: Vladimir Slepnev at October 27, 2004 11:50 AM

Tim wrote: "Mis-using a memory garbage collection for collecting file handles is as smart as using a memory garbage collection that only runs when the system is running out of file handles"

Maybe I'm being a bit naive here, but most of the GC system doesn't know that its only concerned with memory: DOD tells you which objects are dead: post-finalised dead objects may later have their memory reclaimed/reused.

This is important because we can trigger DOD (and finalizers) when *any* resource-acquire fails: not just when memory-acquire fails (and we don't need to wait for for the actual failure, of course). By separating "what objects are dead" from "why do we need to know", we can manage any resource, not just memory.

One thing that occurs to me though: we don't use a finalizer to reclaim memory, so why do we need a finaliser to reclaim other resources. Or perhaps we should use finalisers to release memory, too.

Posted by: Dave Whipp at October 27, 2004 04:45 PM

There is a good example of this tricky issue in the standard Python library for doing XML-RPC. Underneath the hood there is a socket, on top of which is an HTTPConnection and then an XML-RPC request handler. The code is ultimately a mess because http connections can have multiple transactions (or not), and you don't know when everyone who has an interest in the connection is no longer interested since the HTTP headers and body are read by different pieces of code. Eventually someone implemented another layer of reference counting in Python code itself, and only allowed one transaction per connection. There are many many pieces of code calling close() on all the various objects.

I spent two weeks trying to make the server part work over HTTPS before giving up. It was just too hard trying to fight all the code.

If Python had reliable finalizers, then none of this would have been an issue.

Posted by: Roger Binns at October 31, 2004 01:24 AM