Saturday, May 12, 2007

Garbage collector improvements

Factor uses a copying generational garbage collector. When a generation fills up, live objects (determined by scanning roots, and the remembered set; more about this later) are copied to a new semi-space, and the old semi-space is cleared. New objects are allocated in the youngest generation, known as the nursery -- a garbage collection of the nursery only is known as a "minor collection".

Until now, objects would always be promoted to an older generation; the only exception was the oldest generation. Clearly when the oldest generation fills up, there is no older generation to promote live objects to, so instead they stay in this generation.

There was a problem with this strategy. Suppose you have 3 generations. During a long-running calculation which allocates many objects, there will be many minor collections. During a minor collection, most objects in the nursery will probably not be live, and of the few live objects that remain, many will likely survive for a long time. But a certain percentage of the live objects are also short-lived, however they survive because they happen to referenced from the data stack, retain stack, or similar when the garbage collector is invoked. References to these objects are quickly lost by the running program, but they are now in the middle generation. Over time, the middle generation fills up, and the middle generation is then collected. Again, some objects in the nursery appear to be live but they will die very quickly; so now we have some short-lived objects in the oldest generation, the one that takes a long time to garbage collect!

The problem here is that objects were always promoted. Promoting objects from the nursery to the middle generation (which I call aging space) is fine, however what you really want is to keep objects in aging space not until aging space fills up, but until no further garbage collection can free up enough room in aging space for objects incoming from the nursery.

So now objects slowly accumulate in aging space, only making their way into tenured space if there is nowhere else they can go. This greatly reduced the number of full garbage collections. This means many extremely-long lived objects -- word definitions, online help documentation, usage cross-referencing, assorted system structures -- will rarely, if ever have to be scanned by the garbage collector during normal operation. Only a world-changing event, such as reloading core libraries, should unsettle things enough to cause a full garbage collection.


A full garbage collection takes 100 ms; a minor collection can finish in less than a millisecond. Yet, the change to promotion behavior described above actually reduced performance significantly. It took me a couple of days to fully figure out why.

Factor's garbage collector uses card marking to rememeber old to new pointers. When the nursery is being collected, any objects in tenured space which refer to objects from the nursery also have to be scanned. So any time an object is mutated, a "card" is marked; cards have a granularity of 64 bytes. When performing a minor garbage collection, all objects in all marked cards are scanned as roots. With the old garbage collector, full garbage collections (which clear all cards, since all objects end up in tenured space after) were frequent enough that card marking did not create any overhead at all. However if you rarely or never perform a full garbage collection, while running a mutation-heavy operation, more and more cards will get marked. I observed some minor garbage collections would end up scanning as much as 600kb of tenured space due to marked cards; that's a lot of objects to scan!

The problem was that card marking was not granular enough. It would remember the fact that a given card contained an object whose slot referred to an object in aging space or nusery space, but it didn't know which of the two it was. Consider the following scenario.

Object X lives in tenured space. A new object Y is allocated in the nursery, and X is mutated so that one of its slots now holds a reference to Y. When the nursery fills up, a minor collection is initiated, and all marked cards are scanned. Noticing that object X refers to Y, the garbage collector knows that Y is live, and promotes Y to aging space. The card remains marked; a future collection of aging space will again need to scan it. However, the object is no longer in the nursery, and nursery collections can from now ignore this card. But there was no way for the garbage collector to be aware of this fact, so future minor collections would scan this card again, and again, and again, and again, ... because full collections were so rare now, many minor collections were actually scanning thousands of cards.

The fix was to refine the card marking somewhat. A card stores two flags; one indicates that objects in the card refer to objects in the nursery, another flag indicates that objects in the card refer to aging space. When scanning a card during a minor collection, the garbage collector ignores cards which only refer to aging space, and a card which refers to the nursery has this flag cleared after objects in the card are scanned. So back to our example, after object Y makes it to aging space, assuming that X is not mutated again, then further minor collections will never touch object X. A collection of aging space will have to scan X, since it refers to Y; the card is still marked. But it is only marked as holding a reference to aging space, not the nursery.

The result is that now, minor collections only ever scan a handful of cards in tenured space -- or none at all. "Card unmarking" saves the day.

Before this change, the default number of generations was 2. The old promotion behavior never seemed to yield a speedup if more generations were used, so that was always the default. However, any number of generations could be specified on the command line using the -generations switch. Now, things are different. You can only ever elect to have 2 or 3 generations. 2 generations is still the default on the ARM architecture; Factor guesses that ARM machines are memory-constrained, so we trade space for speed there. On all other architectures, 3 generations is the default. When 3 generations are used, the middle generation exhibits the new aging behavior.

There is still one last performance regression I need to track down. It is a constant-factor slowdown which appears even when 2 generations are used. I rewrote some of the GC code in the process of implementing these improvements, and as usual I went for clarity over performance. Since the new GC strategy does not involve an increase in the amount of actual logic which must execute for every scanned object, I'm confident I can get rid of this slowdown.

Even though the new code isn't fully optimized yet, GC time is down on all benchmarks which I tried with the exception of a bootstrap together with a load of all contributed modules. Here the constant factor slowdown I described above bites. When it is fixed, I expect GC time will decrease further, and I will publish some benchmarks.

1 comment:

Anonymous said...

Great post, and nice gc ideas :)