Thursday, October 15, 2009

Improved write barriers in Factor's garbage collector

The Factor compiler has been evolving very quickly lately; it has been almost completely rewritten several times in the last couple of years. The garbage collector, on the other hand, hasn't seen nearly as much action. The last time I did any work on it was May 2008, and before that, May 2007. Now more than a year later, I've devoted a week or so to working on it. The result is a cleaner implementation, and improved performance.

Code restructuring

I re-organized the garbage collector code to be more extensible and easier to maintain. I did this by splitting off a bunch of the garbage collector methods from the factor_vm class into their own set of classes. I made extensive use of template metaprogramming to help structure code in a natural way. Many people dislike C++, primarily because of templates, but I don't feel that way at all. Templates are my favorite C++ feature, and if it wasn't for templates C++ would just be a shitty object-oriented dialect of C.

First up is the collector template class, defined in collector.hpp:
template<typename TargetGeneration, typename Policy> struct collector
This class has two template parameters:
  • TargetGeneration - this is the generation that the collector will be copying objects to. A generation is any class that implements the allot() method.
  • Policy - this is a class that simulates a higher-order function. It implements a should_copy_p() method that tells the collector if a given object should be promoted to the target generation, or left alone.
On its own, the collector class can't do any garbage collection itself; it just implements methods which trace GC roots: trace_contexts() (traces active stacks), trace_roots() (traces global roots), and trace_handle() (traces one pointer).

Next up is the copying_collector template class, defined in copying_collector.hpp:
template<typename TargetGeneration, typename Policy> struct copying_collector
This class has the same two template parameters as collector; the target generation must define one additional method, next_object_after(). This is used when scanning newly copied objects. This class implements logic for scanning marked cards, as well as Cheney's algorithm for copying garbage collection.

Then, there are four subclasses implementing each type of GC pass:Each class subclasses copying_collector and specializes the two template arguments. For example, let's take a look at the declaration of the nursery collector:
struct nursery_collector : copying_collector<aging_space,nursery_policy>
This subclass specializes its superclass to copy objects to tenured space, using the following policy class:
struct nursery_policy {
factor_vm *myvm;

nursery_policy(factor_vm *myvm_) : myvm(myvm_) {}

bool should_copy_p(object *untagged)
return myvm->nursery.contains_p(untagged);

That is, any object that is in the nursery will be copied to aging space by the nursery_collector. Other collector subclasses are similar.

This code all uses templates, rather than virtual methods, so every collector pass will have a specialized code path generated for it. This gives higher performance, with cleaner code, than was is possible in C. The old garbage collector was a tangle of conditionals, C functions, and global state.

Partial array card marking

When a pointer to an object in the nursery is stored into a container in aging or tenured space, the container object must be added to a "remembered set" so that on the next minor collection, so that it can be scanned, and its elements considered as GC roots.

Old way

Storing a pointer into an object marks the card containing the header of the object. On a minor GC, all marked cards are be scanned; if a marked card was bound, then every object whose header is contained in this card would be scanned.


Storing a pointer into an array would necessitate the array to be scanned in its entirety on the next minor collection. This is bad if the array is large. Consider an algorithm that stores successive elements into an array on every iteration, and also performs enough work per iteration to trigger a nursery collection. Now every nursery collection -- and hence every iteration of the loop -- is scanning the entire array. We're doing a quadratic amount of work for what should be a linear-time algorithm.

New way

Storing a pointer into an object now marks the card containing the slot that was mutated. On a minor GC, all marked cards are scanned. Every object in every marked card is inspected, but only the subrange of slots that fit inside the card are scanned. This greatly reduces the burden placed on the GC from mutation of large arrays. The implementation is tricky. I need to spend some time thinking about and simplifying the code, as it stands the card scanning routine has three nested loops, and two usages of goto!


copying_collector.hpp, trace_cards()

New object promotion policy

When aging space is being collected, objects contained in marked cards in tenured space must be traced.

Old way

These cards would be scanned, but could not be unmarked, since the objects they refer to were copied to the other aging semi-space, and would need to be traced on the next aging collection.

The problem

The old way was bad because these cards would remain marked for a long time, and would be re-scanned on every collection. Furthermore the objects they reference would likely live on for a long time, since they're referenced from a tenured object, and would needlessly bounce back and forth between the two aging semi-spaces.

New way

Instead, an aging collection proceeds in two phases: the first phase promotes aging space objects referenced from tenured space to tenured space, unmarking all marked cards. The second phase copies all reachable objects from aging to second aging semi-space. This promotes objects likely to live for a long time all the way to tenured space, and scans less cards on an aging collection since more cards can get unmarked.



Faster code heap remembered set

If a code block references objects in the nursery, the code block needs to be updated after a nursery collection. This is because the machine code of compiled words directly refers to objects; there's no indirection through a literal table at runtime. This improves performance but increases garbage collector complexity.

Old way

When a new code block was allocated, a global flag would be set. A flag would also be set in the code block's header. On the next nursery collection, the entire code heap would be scanned, and any code blocks with this flag set in them would have their literals traced by the garbage collector.

New way

The problem with the old approach is that adding a single code block entails a traversal of the entire code heap on the next minor GC, which is bad for cache. While most code does not allocate in the code heap, the one major exception is the compiler itself. When loading source code, a significant portion of running time was spent scanning the code heap during minor collections. Now, the list of code blocks containing literals which refer to the nursery and aging spaces are stored in a pair of STL sets. On a nursery or aging collection, these sets are traversed and the code blocks they contain are traced. These sets are typically very small, and in fact empty most of the time.


code_heap.cpp, write_barrier()
copying_collector.hpp, trace_code_heap_roots()

Faster card marking and object allocation

The compiler's code generator now generates tighter code for common GC-related operations too. A write barrier looks like this in pseudo-C:
cards[(obj - data_heap_start) >> card_bits] = card_mark_mask;
Writing out the pointer arithmetic by hand, we get:
*(cards + (obj - data_heap_start) >> card_bits) = card_mark_mask;
Re-arranging some operations,
*(obj >> card_bits + (cards - data_heap_start >> card_bits) = card_mark_mask;
Now, the entire expression
(cards - data_heap_start >> card_bits)
is a constant. Factor stores this in a VM-global variable, named cards_offset. The value used to be loaded from the global variable every time a write barrier would execute. Now, its value is inlined directly into machine code. This requires code heap updates if the data heap grows, since then either the data heap start or the card array base pointer might change. However, the upside is that it eliminates several instructions from the write barrier. Here is a sample generated write barrier sequence; only 5 instructions on x86.32:
0x08e3ae84: lea    (%edx,%eax,1),%ebp
0x08e3ae87: shr $0x8,%ebp
0x08e3ae8a: movb $0xc0,0x20a08000(%ebp)
0x08e3ae91: shr $0xa,%ebp
0x08e3ae94: movb $0xc0,0x8de784(%ebp)

Object allocations had a slight inefficiency; the code generated to compute the effective address of the nursery allocation pointer did too much arithmetic. Adding support to the VM for embedding offsets of VM global variables directly into machine code saved one instruction from every object allocation. Here is some generated machine code to box a floating point number; only 6 instructions on x86.32 (of course Factor does float unboxing to make your code even faster):
0x08664a33: mov    $0x802604,%ecx
0x08664a38: mov (%ecx),%eax
0x08664a3a: movl $0x18,(%eax)
0x08664a40: or $0x3,%eax
0x08664a43: addl $0x10,(%ecx)
0x08664a46: movsd %xmm0,0x5(%eax)


cpu/x86/x86.factor: %write-barrier, %allot
cpu/ppc/ppc.factor: %write-barrier, %allot

Performance comparison

I compared the performance of a Mac OS X x86-32 build from October 5th, with the latest sources as of today.

Bootstrap time saw a nice boost, going from 363 seconds, down to 332 seconds.

The time to load and compile all libraries in the source tree (load-all) was reduced from 537 seconds to 426 seconds.

Here is a microbenchmark demonstrating the faster card marking in a very dramatic way:
: bench ( -- seq ) 10000000 [ >bignum 1 + ] map ;

The best time out of 5 iterations on the old build was 12.9 seconds. Now, it has gone down to 1.9 seconds.


Ed Swartz said...

I noticed this interesting instruction in your float unboxing code example:

0x08664a3a: movl $0x18,(%eax)
0x08664a46: movsd %xmm0,0x5(%eax)

Alarm bells went off with that 0x5 offset. From the Intel manuals, unaligned accesses are tolerated but they have twice as many memory accesses:

(4.1.1) "... The reason for this is that the processor requires two memory
accesses to make an unaligned memory access; aligned accesses require only one memory access. A word or doubleword operand that crosses a 4-byte boundary or a quadword operand that crosses an 8-byte boundary is considered unaligned and requires two separate memory bus cycles for access."

(I'm also curious what the byte at 0x4(%eax) is, which is not being updated? The tag byte?)

Is this kind of non-multiple-of-4 offsetting common in generated code? You may see a boost in performance by wasting/reorganizing some memory and not aligning memory so tightly. Just my 2¢ :)

Slava Pestov said...

Ed: I'm aware that unaligned access is slower, but there's no unaligned access going in. It has to subtract the tag from the pointer before dereferencing it.

Webite Designing said...


Overall its great effort but I am also curious about the following thing.


Probably, this was done in Itanium structure or MIPS? I am not sure about it. Would you pleas put some light on it?

- J.
Web Designing