Code restructuring
I re-organized the garbage collector code to be more extensible and easier to maintain. I did this by splitting off a bunch of the garbage collector methods from thefactor_vm
class into their own set of classes. I made extensive use of template metaprogramming to help structure code in a natural way. Many people dislike C++, primarily because of templates, but I don't feel that way at all. Templates are my favorite C++ feature, and if it wasn't for templates C++ would just be a shitty object-oriented dialect of C.First up is the
collector
template class, defined in collector.hpp:template<typename TargetGeneration, typename Policy> struct collectorThis class has two template parameters:
TargetGeneration
- this is the generation that the collector will be copying objects to. A generation is any class that implements theallot()
method.Policy
- this is a class that simulates a higher-order function. It implements ashould_copy_p()
method that tells the collector if a given object should be promoted to the target generation, or left alone.
trace_contexts()
(traces active stacks), trace_roots()
(traces global roots), and trace_handle()
(traces one pointer).Next up is the
copying_collector
template class, defined in copying_collector.hpp:template<typename TargetGeneration, typename Policy> struct copying_collectorThis class has the same two template parameters as
collector
; the target generation must define one additional method, next_object_after()
. This is used when scanning newly copied objects. This class implements logic for scanning marked cards, as well as Cheney's algorithm for copying garbage collection.Then, there are four subclasses implementing each type of GC pass:
nursery_collector
- copies live objects from the nursery into aging space, defined in nursery_collector.hpp and nursery_collector.cppaging_collector
- copies live objects from the first aging semi-space to the second aging semi-space, defined in aging_collector.hpp and aging_collector.cppto_tenured_collector
- copies live objects from aging space into tenured space, defined in to_tenured_collector.hpp and to_tenured_collector.cppfull_collector
- copies live objects from the first tenured semi-space to the second tenured semi-space, defined in nursery_collector.hpp and nursery_collector.cpp
copying_collector
and specializes the two template arguments. For example, let's take a look at the declaration of the nursery collector:struct nursery_collector : copying_collector<aging_space,nursery_policy>This subclass specializes its superclass to copy objects to tenured space, using the following policy class:
struct nursery_policy {
factor_vm *myvm;
nursery_policy(factor_vm *myvm_) : myvm(myvm_) {}
bool should_copy_p(object *untagged)
{
return myvm->nursery.contains_p(untagged);
}
};
That is, any object that is in the nursery will be copied to aging space by the
nursery_collector
. Other collector subclasses are similar.This code all uses templates, rather than virtual methods, so every collector pass will have a specialized code path generated for it. This gives higher performance, with cleaner code, than was is possible in C. The old garbage collector was a tangle of conditionals, C functions, and global state.
Partial array card marking
When a pointer to an object in the nursery is stored into a container in aging or tenured space, the container object must be added to a "remembered set" so that on the next minor collection, so that it can be scanned, and its elements considered as GC roots.Old way
Storing a pointer into an object marks the card containing the header of the object. On a minor GC, all marked cards are be scanned; if a marked card was bound, then every object whose header is contained in this card would be scanned.Problem
Storing a pointer into an array would necessitate the array to be scanned in its entirety on the next minor collection. This is bad if the array is large. Consider an algorithm that stores successive elements into an array on every iteration, and also performs enough work per iteration to trigger a nursery collection. Now every nursery collection -- and hence every iteration of the loop -- is scanning the entire array. We're doing a quadratic amount of work for what should be a linear-time algorithm.New way
Storing a pointer into an object now marks the card containing the slot that was mutated. On a minor GC, all marked cards are scanned. Every object in every marked card is inspected, but only the subrange of slots that fit inside the card are scanned. This greatly reduces the burden placed on the GC from mutation of large arrays. The implementation is tricky. I need to spend some time thinking about and simplifying the code, as it stands the card scanning routine has three nested loops, and two usages of goto!Implementation
copying_collector.hpp,trace_cards()
New object promotion policy
When aging space is being collected, objects contained in marked cards in tenured space must be traced.Old way
These cards would be scanned, but could not be unmarked, since the objects they refer to were copied to the other aging semi-space, and would need to be traced on the next aging collection.The problem
The old way was bad because these cards would remain marked for a long time, and would be re-scanned on every collection. Furthermore the objects they reference would likely live on for a long time, since they're referenced from a tenured object, and would needlessly bounce back and forth between the two aging semi-spaces.New way
Instead, an aging collection proceeds in two phases: the first phase promotes aging space objects referenced from tenured space to tenured space, unmarking all marked cards. The second phase copies all reachable objects from aging to second aging semi-space. This promotes objects likely to live for a long time all the way to tenured space, and scans less cards on an aging collection since more cards can get unmarked.Implementation
aging_collector.cppFaster code heap remembered set
If a code block references objects in the nursery, the code block needs to be updated after a nursery collection. This is because the machine code of compiled words directly refers to objects; there's no indirection through a literal table at runtime. This improves performance but increases garbage collector complexity.Old way
When a new code block was allocated, a global flag would be set. A flag would also be set in the code block's header. On the next nursery collection, the entire code heap would be scanned, and any code blocks with this flag set in them would have their literals traced by the garbage collector.New way
The problem with the old approach is that adding a single code block entails a traversal of the entire code heap on the next minor GC, which is bad for cache. While most code does not allocate in the code heap, the one major exception is the compiler itself. When loading source code, a significant portion of running time was spent scanning the code heap during minor collections. Now, the list of code blocks containing literals which refer to the nursery and aging spaces are stored in a pair of STL sets. On a nursery or aging collection, these sets are traversed and the code blocks they contain are traced. These sets are typically very small, and in fact empty most of the time.Implementation
code_heap.cpp,write_barrier()
copying_collector.hpp,
trace_code_heap_roots()
Faster card marking and object allocation
The compiler's code generator now generates tighter code for common GC-related operations too. A write barrier looks like this in pseudo-C:cards[(obj - data_heap_start) >> card_bits] = card_mark_mask;Writing out the pointer arithmetic by hand, we get:
*(cards + (obj - data_heap_start) >> card_bits) = card_mark_mask;Re-arranging some operations,
*(obj >> card_bits + (cards - data_heap_start >> card_bits) = card_mark_mask;Now, the entire expression
(cards - data_heap_start >> card_bits)is a constant. Factor stores this in a VM-global variable, named
cards_offset
. The value used to be loaded from the global variable every time a write barrier would execute. Now, its value is inlined directly into machine code. This requires code heap updates if the data heap grows, since then either the data heap start or the card array base pointer might change. However, the upside is that it eliminates several instructions from the write barrier. Here is a sample generated write barrier sequence; only 5 instructions on x86.32:0x08e3ae84: lea (%edx,%eax,1),%ebp
0x08e3ae87: shr $0x8,%ebp
0x08e3ae8a: movb $0xc0,0x20a08000(%ebp)
0x08e3ae91: shr $0xa,%ebp
0x08e3ae94: movb $0xc0,0x8de784(%ebp)
Object allocations had a slight inefficiency; the code generated to compute the effective address of the nursery allocation pointer did too much arithmetic. Adding support to the VM for embedding offsets of VM global variables directly into machine code saved one instruction from every object allocation. Here is some generated machine code to box a floating point number; only 6 instructions on x86.32 (of course Factor does float unboxing to make your code even faster):
0x08664a33: mov $0x802604,%ecx
0x08664a38: mov (%ecx),%eax
0x08664a3a: movl $0x18,(%eax)
0x08664a40: or $0x3,%eax
0x08664a43: addl $0x10,(%ecx)
0x08664a46: movsd %xmm0,0x5(%eax)
Implementation
cpu/x86/x86.factor:%write-barrier
, %allot
cpu/ppc/ppc.factor:
%write-barrier
, %allot
Performance comparison
I compared the performance of a Mac OS X x86-32 build from October 5th, with the latest sources as of today.
Bootstrap time saw a nice boost, going from 363 seconds, down to 332 seconds.
The time to load and compile all libraries in the source tree (
load-all
) was reduced from 537 seconds to 426 seconds.Here is a microbenchmark demonstrating the faster card marking in a very dramatic way:
: bench ( -- seq ) 10000000 [ >bignum 1 + ] map ;
The best time out of 5 iterations on the old build was 12.9 seconds. Now, it has gone down to 1.9 seconds.
3 comments:
I noticed this interesting instruction in your float unboxing code example:
0x08664a3a: movl $0x18,(%eax)
...
0x08664a46: movsd %xmm0,0x5(%eax)
Alarm bells went off with that 0x5 offset. From the Intel manuals, unaligned accesses are tolerated but they have twice as many memory accesses:
http://download.intel.com/design/processor/manuals/253665.pdf
(4.1.1) "... The reason for this is that the processor requires two memory
accesses to make an unaligned memory access; aligned accesses require only one memory access. A word or doubleword operand that crosses a 4-byte boundary or a quadword operand that crosses an 8-byte boundary is considered unaligned and requires two separate memory bus cycles for access."
(I'm also curious what the byte at 0x4(%eax) is, which is not being updated? The tag byte?)
Is this kind of non-multiple-of-4 offsetting common in generated code? You may see a boost in performance by wasting/reorganizing some memory and not aligning memory so tightly. Just my 2¢ :)
Ed: I'm aware that unaligned access is slower, but there's no unaligned access going in. It has to subtract the tag from the pointer before dereferencing it.
Hi,
Overall its great effort but I am also curious about the following thing.
0x4(%eax)
Probably, this was done in Itanium structure or MIPS? I am not sure about it. Would you pleas put some light on it?
- J.
Web Designing
Post a Comment