Saturday, May 10, 2008

Garbage collection throughput improvements

The problem


About a month ago, I made some changes to the garbage collector. These changes included drastically decreasing the default nursery size from 16mb to 1mb. This turned out to have a negative effect on performance, with GC consuming as much as 60% of run time on some tests.

Instead of making the default nursery larger, which would just mask the problem until we hit workloads with 16 times as much allocation (or when the compiler began to generate code that was 16 times faster, or any combination of the above), I decided to investigate the root cause of the problem.

Turns out that every minor GC was taking on the order of a few milliseconds at the minimum, which is too much time. Most of the time was spent scanning the cards array for marked cards. Even if there weren't any marked cards, with a 64mb heap and 64 byte card size, that's a million array entries to check.

Faster card scanning


The first thing I did was change how the card scan loop operates. Instead of iterating over the array a byte at a time, and checking if each byte satisfies the card mark mask, we iterate over the array four bytes at a time, checking each group of four bytes against the mask. GCC should really be doing this optimization for me, because its a simple case of vectorization, but it doesn't, so I just wrote the code out by hand:
u32 *quad_ptr;
u32 quad_mask = mask | (mask << 8) | (mask << 16) | (mask << 24);

for(quad_ptr = (u32 *)first_card; quad_ptr < (u32 *)last_card; quad_ptr++)
{
if(*quad_ptr & quad_mask)
{
F_CARD *ptr = (F_CARD *)quad_ptr;

int card;
for(card = 0; card < 4; card++)
{
if(ptr[card] & mask)
{
collect_card(&ptr[card],gen,here);
ptr[card] &= ~unmask;
}
}
}
}

This improved card scan performance by a factor of 3, which makes sense, because we do four times less loop iterations, but at some point, you still have to hit memory, which dampens things out a bit.

This optimization improved performance, but not sufficiently so. I realized that card scanning hurt the most in benchmarks which didn't write into objects in the old generation at all; for example, a benchmark which operates on bignums, where the intermediate values are extremely short-lived, fills up the nursery very rapidly, and every minor GC checks all cards, which not only fills up the cache with junk, but doesn't achieve anything useful as no cards will be marked.

On the other hard, a pointer recording approach where all stores are written into a buffer is also unacceptable, because in code with a high rate of mutation, you pay a hefty price every time the buffer overflows. So I had to find a compromise.

Introducing card decks


My solution is something I haven't seen done anywhere else before, but it is pretty trivial if you think about it. It amounts to two-level card marking. We have an array of cards, where each card maps to a small amount of memory: 64 bytes in my case, but it could be anything; and an array of card decks. Each card deck corresponds to 1024 cards, and if a card is marked, then its deck must be marked too. This way, on a minor GC, we scan the deck array first, and then only check cards corresponding to marked decks.

This complicates the write barrier, because now on every store we have to mark a card and a deck, not just a card as before. I decided to counterweight this by simplifying the write barrier in another way. Formerly, it would read the card, "or" it with a mask, and store it back. This was done because the card contained an additional piece of information, and that is the offset within the card where the first object begins. Cards have two mark bits, denoting a possible pointer from this generation to either aging space or the nursery, and the other six bits were reserved for the offset of the first object. By moving object offsets (I call them "allot markers") into a separate array, I removed a read operation from the write barrier; now it simply has to write the mark mask to the card, without caring about its existing contents. This also freed up two bits in the allot marker, allowing me to switch to 256 byte cards (and 256 kilobyte card decks).

These improvements really helped on benchmarks which allocated large numbers of extremely short-lived objects, however on benchmarks which allocated longer-lived objects the improvements were still major but not so drastic. It turns out that if objects survives long enough to be copied from the nursery into aging space, then it is possible for this copying to begin to dominate running time.

Low-level optimizations


The inner loop of the GC looked like so:
void copy_handle(CELL *handle)
{
CELL pointer = *handle;

if(!immediate_p(pointer) && should_copy(pointer))
*handle = copy_object(pointer);
}

CELL collect_next(CELL scan)
{
do_slots(scan,copy_handle);

if(collecting_gen == TENURED)
do_code_slots(scan);

return scan + untagged_object_size(scan);
}

This is very elegant code; do_slots is a higher-order function which applies a function pointer to each slot of the object. However, GCC doesn't try to optimize higher-order code at all, after all it is a C compiler not an ML compiler, and it isn't Sufficiently Smart Either! So I rewrote the above function by manually inlining do_slots() and replacing the function pointer invocation with the body of copy_handle(). The next thing I realized was that should_copy() contains a large series of conditional tests which did not depend on its input parameters, only global variables whose value was invariant for the duration of the collection:
/* test if the pointer is in generation being collected, or a younger one. */
INLINE bool should_copy(CELL untagged)
{
if(in_zone(newspace,untagged))
return false;
if(collecting_gen == TENURED)
return true;
else if(HAVE_AGING_P && collecting_gen == AGING)
return !in_zone(&data_heap->generations[TENURED],untagged);
else if(HAVE_NURSERY_P && collecting_gen == NURSERY)
return in_zone(&nursery,untagged);
else
{
critical_error("Bug in should_copy",untagged);
return false;
}
}

So again, I did a bit of hand-optimization. If foo() does not depend on the loop variable or the side effects of the body, then the following are equivalent:
while(A)
{
if(foo()) B else C;
}

if(foo)
{
while(A) B;
}
else
{
while(A) C;
}

In the optimizing compiler literature, this is known as loop unswitching.

The new collect_next() is much longer, but also much faster. It looks like this:
CELL collect_next(CELL scan)
{
CELL *obj = (CELL *)scan;
CELL *end = (CELL *)(scan + binary_payload_start(scan));

obj++;

CELL newspace_start = newspace->start;
CELL newspace_end = newspace->end;

if(HAVE_NURSERY_P && collecting_gen == NURSERY)
{
CELL nursery_start = nursery.start;
CELL nursery_end = nursery.end;

for(; obj < end; obj++)
{
CELL pointer = *obj;

if(!immediate_p(pointer)
&& (pointer >= nursery_start && pointer < nursery_end))
*obj = copy_object(pointer);
}
}
else if(HAVE_AGING_P && collecting_gen == AGING)
{
F_ZONE *tenured = &data_heap->generations[TENURED];

CELL tenured_start = tenured->start;
CELL tenured_end = tenured->end;

for(; obj < end; obj++)
{
CELL pointer = *obj;

if(!immediate_p(pointer)
&& !(pointer >= newspace_start && pointer < newspace_end)
&& !(pointer >= tenured_start && pointer < tenured_end))
*obj = copy_object(pointer);
}
}
else if(collecting_gen == TENURED)
{
for(; obj < end; obj++)
{
CELL pointer = *obj;

if(!immediate_p(pointer)
&& !(pointer >= newspace_start && pointer < newspace_end))
*obj = copy_object(pointer);
}

do_code_slots(scan);
}
else
critical_error("Bug in collect_next",0);

return scan + untagged_object_size(scan);
}

That's a hell of a code explosion, but it made a real difference to the performance of the garbage collector.

Benchmarks


The first benchmark just allocates 3-element arrays in a loop:
: garbage 1 2 3 3array ;

: garbage-loop 150000000 [ garbage drop ] times ;

[ garbage-loop ] time

Here are the results with different nursery sizes; all times are in milliseconds and I ran each benchmark multiple times to ensure I was getting reliable results:
BeforeAfter
1mb nursery:89084461
2mb nursery:64554451
4mb nursery:53364582
8mb nursery:47794582

The second benchmark allocates larger arrays in a loop:
: garbage 100 f <array> ;

: garbage-loop 15000000 [ garbage drop ] times ;

[ garbage-loop ] time

The results are similar:
BeforeAfter
1mb nursery:104182158
2mb nursery:6364 2178
4mb nursery:4966 3223
8mb nursery:4249 3294

An interesting phenomenon occurs where after the GC changes, a larger nursery actually begins to hurt performance. Dan and I hypothesized that this is due to the larger nursery hurting locality, with poor algorithms masking this effect with the older GC.

The next benchmark allocates a lot of temporary arrays but stores them into larger arrays which are somewhat longer lived:
: garbage 100 f <array> ;

: garbage-loop 150 [ 10000 [ drop garbage ] map drop ] times ;

[ garbage-loop ] time

The size of the aging generation makes a difference here so I tried the benchmark with more parameters:
BeforeAfter
1mb nursery/2mb aging:86935525
2mb nursery/2mb aging:73185255
4mb nursery/2mb aging:39392628
8mb nursery/2mb aging:23141526
1mb nursery/8mb aging:34551493
4mb nursery/4mb aging:30241894
4mb nursery/4mb aging:31571577
8mb nursery/8mb aging:15331003

The above benchmarks are somewhat unrealistic, however other benchmarks showed speedups too, some more drastic than others:


BenchmarkBeforeAfter
SHA1121987282
Random1898914459
Sort1568513476
Raytracer2770012966
Mandelbrot46331847

Before the shrinking of the nursery, the runtimes of these benchmarks looked much like they do now; it was only until a month ago that GC time began to dominate benchmarks like that. However I like having a small nursery, and making it smaller forced me to optimize the garbage collector for higher throughput in constrained-memory conditions.

Looking ahead


The next big change I will be making to the garbage collector is switching to using mark/sweep for the old generation. This will reduce memory consumption by almost 50%.

The other side of this coin is compiler optimizations. If the compiler performed more advanced static analysis, it could eliminate much of the dynamic memory allocation in the above benchmarks, cutting GC time further. Of course, if we have a great compiler and a great GC, we'll just have great performance all around.

7 comments:

Anonymous said...

BTW: Do you have any times on bootstrapping? I have the feeling it all compiles a bit faster. But I can be wrong. I have no data on this.

ebb said...

The idea of using marked card decks sounds similar to a technique used in YHC (a Haskell compiler). They store node marks in an external bit-vector. This way, they only have one level of marks but it's still possible to vectorize the scan for marked nodes.

ebb said...

I should mention that you can read about the YHC run-time here:

http://www.haskell.org/haskellwiki/Yhc/RTS/Machine

_ said...

Fantastic. Keep it up.

Anonymous said...

Could it be that this solution is addressing a "should-not-exist-problem"?

The code 1 2 3 3array drop should not produce any garbage at all, as the tree first values are types (and not references to objects) and 3array is not addressed by another object at all?

Could the garbage-collector only "be activated" when it sees an object, which is containing a reference to another object?

Petter

Slava Pestov said...

Petter: the compiler will indeed optimize away 1 2 3 3array drop, which is why I had to put it in its own non-inline word. That benchmark is not realistic but as you can see the other benchmarks such as Raytracer, SHA1, etc also exhibited improvements and they are a lot more realistic than the first three.

Anonymous said...

However, GCC doesn't try to optimize higher-order code at all, after all it is a C compiler not an ML compiler, and it isn't Sufficiently Smart Either!

Looks like it's time to change your mind on question 30 from your FAQ. C-way of optimization is allways "hell of a code explosion".