Squawks of the Parrot: What the heck is: Garbage Collection

June 06, 2003

What the heck is: Garbage Collection

And, more importantly, why should you really care?

Nearly all programs create garbage as they run¹, and nearly all programs clean up that garbage in one way or another. Garbage Collection is what we call it when the cleanup is done automatically. The automatically part is what makes garbage collection interesting and useful.

In a language like C, you can explicitly allocate memory from the system with the malloc function. It takes a size, and returns you back a pointer to the newly allocated memory. When your program is done with the memory, it calls the free function, passing in a pointer to the memory that was allocated, and the memory is given back to the system. Nice, simple, straightforward. And entirely manual.

The big drawback to the manual system is that if you make a mistake, Bad Things Happen. Freeing memory too soon means your program has a pointer to memory it no longer owns. Using that pointer will cause your program to crash as it tries to access memory outside its region. Or, worse yet, access memory that's been recycled and used for other things. Freeing memory too late, or not at all, means that your program leaks memory, leaving chunks of unused and unusable memory lying around. This is less of a problem, as at least you won't crash, but it will mean the system eventually runs out of memory to hand out, and your program will fail.

How do you then track allocated memory to make sure that you hold onto it as long as necessary, but no longer? The way a lot of programs do it is with something called reference counting. Each chunk of memory has an integer counter attached to it. Each time your program makes reference to the memory (generally by storing a pointer to it somewhere) the count is incremented. Each time a reference is deleted the reference count is decremented. When the reference count goes to zero, the memory is freed.

Unfortunately reference counting has a number of problems associated with it. First, of course, your code needs to fiddle with the reference count properly, and everywhere. This is, in a large program, potentially error-prone--memory may live forever, or get deleted too soon. Both are bad, leaving you with the leak or segfault problems. It also clutters your code with refcount fiddling calls and, even if you reduce them to macros or function calls, it's more code to wade through when you're writing and debugging the code.

The other big problem is the runtime cost. All your memory allocations need to be bigger, to hold the reference count. Your code is doing a lot of math and, while it's simple addition and subtraction, it takes time. And it means your deallocation system may need to be fancy, since if a chunk of memory that gets deallocated has pointers to other pieces of memory, those pieces need their refcounts manipulated. (While this is a cost you have to pay in the manual case, a refcount-based system is usually a general-purpose thing, so the deallocator's a generic deallocator rather than a special-purpose one with deep knowledge of the data being deallocated, so it needs to think more when deallocating things)

All this is definitely a pain. (Been there, done that, got the T-shirt) It's far better than leaking memory, though, and it does have the twin advantages of being conceptually simple and very portable. It's the system that many of the interpreted languages such as perl and python use, as well as some of the compiled languages like Objective C. (Though it's as much of a function of the Objective C object library as the language itself, but that's what's in widespread use)

The alternative is real, fully automatic, garbage collection. What a garbage collector does is track all your program's allocations, then it look through all your data and figures out which chunks of memory are in use and which aren't. When it finds pieces of memory that aren't used, it releases them to the system for later reallocation.

Garbage collection always starts by assuming everything is dead. It then begins with what's called the root set--a set of things that are always considered alive. Anything the root set references is marked as living. Then everything that this newly declared "living" memory references is considered alive, so then we walk through that memory, and so on. By the time we're done, anything that's still dead is unreferenced and can be safely cleaned up. In object systems that have destructors, finalizers, last-gasp methods, or other things that get called right before something gets killed, this is the time it happens.

Garbage collection is a nice thing. Your main program code doesn't have to worry about tracking memory, since the garbage collection system handles it for you. That means less bookkeeping code to write, and almost as importantly less bookkeeping code to execute. Your code allocates memory when it needs to, and puts references to that memory wherever it wants. There's no tracking, counting, or reference work

A garbage collector is not a simple thing to write, not by any means. They're very system specific in places, and efficient collectors are generally complex, low-level, nasty pieces of code. That's the big downside. The garbage collection code is usually small and isolated, however, which offsets this to some extent. While you still need to understand it, only one or two people need to understand it to work on it, while the rest of the people working on the project don't have to worry about it at all. Memory allocation bugs are also isolated to a small section of your code--the memory allocator and the garbage collector--rather than scattered across the entire code base.

Luckily there are a number of pre-made garbage collection systems available free for use, such as the Boehm collector, that already handle all the nasty bits. You plug them into your program and they just work. Pre-made collectors tend to be more conservative and a bit slower than a hand-rolled custom collector, as they can't have as much knowledge about your program as you do, but the advantages involved in using a mature code base and just not having to bother are significant. While personally I like writing GC code, if you don't I'd recommend grabbing the Boehm collector.

The big downside to garbage collection is indeterminacy. Your program holds on to memory longer than it otherwise might, can't be sure of how long an allocation might take (since allocating memory may require firing off a garbage collection run if your program runs out), and can no longer predict exactly when some resource that requires active destruction may be cleaned up. While a number of the garbage collection schemes have bounded worst-case memory usage and collection times, their worst case performance (either in memory usage or pause) is generally worse than a refcounting scheme.

Interestingly, depending on your program's allocation patterns, a garbage collection system can have better average and best case performance than a refcounting system. (Both of which are beat by a manual tracking system's best case performance, though the complexity of a fully manual system generally makes it unsuitable for large systems just because it's such an enourmous pain to get it right) A GC system doesn't need to allocate a counter word per allocation and doesn't need to twiddle that counter word, and a GC's code is normally more cache friendly

Refcount schemes can, of course, have lengthy pauses of their own--freeing up the root node of a structure with a million or more elements will make your program stall. A refcounting system also spends significantly more overall time doing memory management tasks, usually twice as much (except in the most pathological cases) so a program with a real GC will generally run a bit faster than one with a refcount system. Not hugely faster, since your program ought not be spending all that much time in the memory allocation system, but every percent or two counts.

¹ I do actually know of a few non-trivial programs that, when running, do not allocate any dynamic memory at all outside a very limited, and tightly bounded, bit of stack memory. (And I could probably, with work, put my hands on a few that don't use stack, either) They're quite rare, and somewhat special-purpose, but they do exist.

Posted by Dan at June 6, 2003 06:37 PM | TrackBack (3)

Comments

This is an excellent introduction. Short, self-contained, easy to follow, with the major trade-offs enumerated. Very well done, great job!

While I prefer garbage collection, its nondeterministic finalization keeps biting me. I would like to be able to explicitly (and implicitly, via stack-based objects [in the lexical sense]) tell the GC to synchronously finalize an object.

I'm fine if this is merely a "hint", like many GC's explicit "run now" calls, of what the GC should do. I understand it's ultimately the GC that needs to make the decision, and the programmer may not see the entire picture.

Posted by: rentzsch at June 6, 2003 09:54 PM

There are two ways to synchronously finalize an object. First, of course, is explicit finalization--you exit the scope, and the system kills the variables. Less than optimal if something could still have a handle on them, but works fine in languages that can do sufficient tracing and have sufficient control over itself to properly nuke things.

The alternate is to run a conditional GC sweep at block exit, which is the scheme we're working with for Parrot. Variables can mark themselves as "impatient", and we keep an impatient counter. Conditional sweeps only fire off if there are live impatient data, something the sweep recalculates each time it runs (it's a flag in each PMC) so that when there are no impatient variables the sweep is skipped.

Not great, but not bad.

Posted by: Dan at June 6, 2003 10:56 PM

A really nice overview / intro. Some details and issues that I've not seen anywhere before. Keep up with these "What the heck is" articles, they really are top class.

Posted by: Mark at June 7, 2003 08:43 PM

"important" variables trigger GC at scope end?

So, if I have something like:



{

  my $logfile =     ImportantFileHandle::new(" {

    unless /^#/ {

      chomp;

      processline($_);

    }

  }

  # ImportantFileHandles close at scope exit

}

I'm not claiming that this is good code, mind you, but it might not be uncommon to open a file at the beginning of the program and hold it open until program end.

Does this mean that a GC is going to be called for every line in the log file, because the count of "important" variables will always be at least 1?

Posted by: Buddha Buck at June 9, 2003 10:28 AM

Good article.

In the defense of refcounting (which I am not a big fan of neither), there are some ways the cost associated with updating the refcount of a variable in a loop (or some similar situation) can be greatly reduced, by delaying the update work to the end of the scope for example. Also, you say: "your code needs to fiddle with the reference count properly". I found this sentence to be somewhat misleading, giving the impression that the 'user code' might have to deal with refcounts, while it is only the compiler-generated code that will have to deal with it.

A really big problem of refcounting that you didn't mention is its inability to reclaim cycles. That may be in the end the biggest defect of refcounting. Perl programmers are aware of that constraint in perl, and some think this is inherent to GC, while this defect isn't present when using other techniques.

About indeterminacy in time and space of garbage collection: that problem has been solved. It surprised me too, but there really are techniques for GC bounded in time and space, executing in hard real-time concurrently with the user's program, even on SMP machines while staying scalable. An interesting article about that from Perry Cheng is available at the following URL:

citeseer.nj.nec.com/cheng01scalable.html

BTW I think that you make a really good job at presenting GC as a good thing, which everyone should agree it is.

keep on the good work!

Posted by: Guillaume at June 9, 2003 01:56 PM

Importance declarations may cause end-of-block checking if the language so chooses, which can definitely cause pathological behavior. There are ways around this--for example the language can defer marking a variable as important until the end of the block in which the variable is created or assigned to (if we get clever and use return value hints on subs, methods, and functions) or not emit a conditional DOD sweep at the end of loop blocks, rather deferring it until the end of the loop, or only every n loops, or something of the sort.

It is definitely possible to get degenerate behaviour out of this mechanism, but one of the things it does have going for it is that very few objects in the normal course of processing will get marked as immediate, as it isn't that common a thing. Filehandles may not by default, for example. We may also defer it to sub exit rather than block exit, or tune it in some way, or something of the sort.

Posted by: Dan at June 9, 2003 02:20 PM

Nice article. Explanation at a level I could follow. However, it did set me thinking about "a better way".

I've got considerable C skills from gui apps to device drivers, but no compiler writing, so anything I think is probably old hat, off-base or just simplistic, but the notion won't leave my head until some says "that'll never work", so here it is.

Reference counting is liked because of timely destruction., and hated because it adds overhead--both memory and math--to every variable access.

Mark&sweep is favoured because it removes the overhead in favour of small bursts of intense activity at times convenient to the er... compiler? interpreter? runtime? The thing that worries people about this, is the "at sometime convenient to some indeterminate heuristic" part.

The thought that struck me was the C concept of automatic variables. They come into existence at exactly the level of scope that they need to, and they are automatically cleaned up when they go out of scope. So simple. Garbage Collection becomes a simple SP+n operation.

The problem is that they *can't* exists beyond their scope. The old "Don't return references to stack vars problem". And that's when it struck me, why bother GCing *everything*?

Why not allocate local/lexical scoped variables from a "stack" and only promote them to a "referenced variable heap" if the code actually takes a reference to them?

The testing for which type any given variable is would be restricted to code generation time.

At runtime, the code would have been generated to deal with each individual var in the appropriate manner. Un-referenced vars being accessed via a "stack pointer" with no reference counting necessary, and automatically GC'd en-masse by a one-off SP manipulation at scope exit.

Referenced variables would be accessed from the heap, with whatever reference counting/mark&sweep mechanism was chosen.

This would benefit the majority of short-lived, locally scoped vars as they would never need to be GC'd. And the GC benefits by processing only those vars that have had references taken, reducing the both the volume and frequency of GCing.

Talking about a "stack", when the underlying subject is Parrot which has been designated a register-based VM may cause some handwaving. Are the Parrot VM "registers" actually allocated once at startup, or does the block of memory where the registers live move as scopes are entered and left? The latter is what I envisage, but I'm still trying to wrap my brain around the code to confirm my mental picture one way or the other.

Posted by: Buk at July 23, 2003 06:10 AM

Languages without closures, coroutines, or continuations can immediately destroy variables that go out of scope if the compiler's determined that the variables are no longer referenced. Static analysis of the code can determine this with some certainty. (Variables fall into one of three categories--definitely referenced, definitely not referenced, and maybe referenced) Most do, since it makes things go faster--less stuff to trace, and it's a compile-time determination.

Why not allocate local/lexical scoped variables from a "stack" and only promote them to a "referenced variable heap" if the code actually takes a reference to them?

In the presence of continuations (and to a lesser extent coroutines and closures, especially with dynamic compilation), all local/lexical variables are potentially referenced. You can't determine at compile time what is and isn't referenced, and thus your trick won't work.

Allowing runtime stack snooping, and other forms of introspection, will also make this problematic, as will many forms of operator overloading. (Since an overloaded operator function may retain a reference to either itself or its parameter)

It's useful, but unfortunately there are a number of language characteristics that will kill static analysis dead. Unfortunately.

As for Parrot's registers, they're in a fixed location, in the interpreter structure. There's a backing stack for them, so when registers are saved on that stack they get copied into a stack frame. They don't move, though, which speeds up both the interpreter and JIT, which is a nice win.

Posted by: Dan at July 23, 2003 12:56 PM

Thanks for putting my mind at rest, and sorry for the duplicate post earlier.

Posted by: buk at July 23, 2003 04:40 PM

HI
anand

Posted by: Anand at August 12, 2003 12:39 AM

A couple of additional points that I want to add to this excellent article. I've been doing some research recently on the speed differences between GC and manual memory management. While it always depends on the specific application and the patterns of memory use, the argument that GC is slower or more costly than manual memory management simply doesn't hold in most situations. Here are the reasons I've found:

First and most importantly, most of the time it simply DOESN'T MATTER. Usually, all other things being equal, GC vs. manual memory management makes NO perceivable difference to the user of the application. As was mentioned in the article, most programs spend only a few percent of their time in memory management. In addition, most programs leave the CPU idle 99% of the time. The logical conclusion is that a slight difference in CPU usage between memory management methods won't make any noticeable difference in performance. This has been confirmed in many real-world studies. There are exceptions, but this covers the vast majority of modern desktop computer applications. This reason alone should be enough to make 95% of the world's programmers stop reading right now and switch to a GC-based programming environment, if at all possible in their situation. (One rebuttal to this is that GC causes the system to become unresponsive during collection, but this simply isn't true anymore -- a well written, modern GC is incremental and will not cause noticable pauses in any but the most unreasonable situations. malloc/free also have the potential to cause pauses in worst-case situations, so there isn't any significant performance difference between malloc/free and GC.)

While 95% of you should be convinced already, there are some people for whom the above argument is insufficient. Those who write OS kernel code or performance-critical systems have a reason to care. Other people are simply interested in more information. For these people, I make the argument that GC is not a clear loser even in terms of raw performance. Again, it depends on the application's memory usage patterns and the capabilities of the system, but in many cases GC will IMPROVE performance over manual memory management. To start out, here are the common arguments against GC:

1. Allocation might trigger a collection, which means it won't work for real-time needs.
2. You never know when a collection will cause the application to pause.
3. I need the performance. GC is too slow.

1. With GC, an allocation might trigger a collection, so it might be slow. However, this isn't a valid argument against GC. NO dynamic memory scheme has any kind of hard time limits on allocation/deallocation. malloc has to search through memory to find a best fit, and sometimes this can take quite a while. RefCounting is a layer over malloc, so it has the same limitations. In addition, all allocators can cause heap expansion, and all allocators can trigger one or many page faults. These costs are MUCH greater than the cost of a collection (except on poorly written GCs, or GCs that have to deal with hundreds of megabytes of heap at once). So the argument that GC's allocator is more unpredictable than malloc's is pointless. If the unpredictable timing of a GC's allocator in unacceptable, malloc is equally unacceptable, and the mechanisms for working around malloc's limitations apply equally as well to GC's allocator.

2. Collections can cause the application to pause. This is essentially true, but not necessarily significant. First, GC strategy has come a LONG way in the past few years.

Posted by: dcook at January 18, 2004 04:34 AM

(previous comment truncated...) Modern GC implementations are usually incremental, so they spread their work out over several partial collections . Second, computers are much faster than they used to be, so even full collections don't cause a noticable pause except in worse-case scenarios. Finally, applications are already pausing for all kinds of other things - virtual memory paging operations, background tasks, etc. The pauses caused by a garbage collector are usually insignificant when compared with these.

3. People argue that GC is slower than manual memory management. While in some cases this is true, it often simply isn't the case. GC can be FASTER than manual memory allocation, depending on how the application uses memory. First, malloc isn't free. In many cases, the total amount of CPU time spent in malloc/free can be equal to or greater than the CPU time spent in garbage collection . It depends on the size and pattern of the allocations: malloc/free is faster for occasional large allocations, while a GC is usually FASTER than malloc/free for frequent small allocations. A custom layer on top of malloc/free can help speed the worst-case scenarios, but even with custom allocators, GC and malloc/free are evenly matched in terms of total raw CPU cycles required. Second, for most applications, page faults and cache behavior are significantly more important to performance than raw CPU time. In these areas, there are some significant advantages to GC. Third, with manual memory management, there is a lot of code involved in tracking memory. This code isn't needed for GC, so your app can be smaller and faster without it.

The bottom line is that for most applications, GC will not cause any performance problems, and might even IMPROVE performance. The results vary from situation to situation, but in most cases GC is a clear winner. Even if performance is important, consider trying a GC package and do some profiling to see what impact it has on your performance. You might be surprised.

Posted by: dcook at January 18, 2004 04:41 AM

Dan, thank you for your hard work on Parrot and for these enlightening "What the heck is ..." posts.

Like Buk, above, I've got a "Why not ...?" question, that I was wondering if you could shed some light on. It has to do with dead-object detection. You described the basic algorithm in your article: first, mark everything dead; second, mark everything reachable from the root set alive; third, delete everything still marked dead.

If each object had a "fad" bit instead of a "live" bit, which was toggled in step two instead of set, it seems that you could skip step one. Step three would be to delete every object that hadn't kept up with the current "fad."

This seems like an obvious optimization--I'd call it the "high school algorithm"--which tells me that I must be missing something. Thanks.

Posted by: Sam at June 8, 2004 10:30 AM