Monday, November 16, 2009

Mark-compact garbage collection for oldest generation, and other improvements

Factor now uses a mark-sweep-compact garbage collector for the oldest generation (known as tenured space), in place of the copying collector that it used before. This reduces memory usage. The mark-sweep collector used for the code heap has also been improved. It now shares code with the data heap collector and uses the same compaction algorithm.

Mark-sweep-compact garbage collection for tenured space

Mark phase

During the mark phase, the garbage collector computes the set of reachable objects by starting from the "roots"; global variables, and objects referenced from runtime stacks. When an object is visited, the mark phase checks if the object has already been marked. If it hasn't yet been marked, it is marked, and any object that it refers to are then also visited in turn. If an object has already been marked, nothing is done. As described above, the algorithm is recursive, which is problematic. There are two approaches to turn it into an iterative algorithm; either objects yet to be visited are pushed on a mark stack, and a top-level loop drains this stack, or a more complicated scheme known as "pointer inversion" is used. I decided against pointer inversion, since the mark stack approach is simpler and I have yet to observe the mark stack grow beyond 128Kb or so anyway. I use an std::vector for the mark stack and it works well enough. The mark stack and the loop that drains it are implemented in full_collector.cpp. There are several approaches to representing the set of objects which are currently marked, also. The two most common are storing a mark bit for each object in the object's header, and storing mark bits in a bitmap off to the side. I chose the latter approach, since it speeds up the sweep and compact phases of the garbage collector, and doesn't require changing object header layout. Each bit in the mark bitmap corresponds to 16 bytes of heap space. For reasons that will become clear, when an object is marked mark bitmap, every bit corresponding to space taken up by the object is marked, not just the first bit. The mark bitmap is implemented in mark_bits.hpp.

Sweep phase

Once the mark phase is complete, the mark bitmap now has an up-to-date picture of what regions in the heap correspond to reachable objects, and which are free. The sweep phase begins by clearing the free list, and then computes a new one by traversing the mark bitmap. For every contiguous range of clear bits, a new entry is added to the free list. This is why I use a mark bitmap instead of mark bits in object headers; if I had used mark bits in headers, then the sweep phase would need to traverse the entire heap, not just the mark bitmap, which has significantly worse locality. The sweep phase is implemented by the free_list_allocator::sweep() method in free_list_allocator.hpp. The key to an efficient implementation of the sweep algorithm is the log2() function in bitwise_hacks.hpp; it is used to find the first set bit in a cell. I use the BSR instruction on x86 and cntlzw on PowerPC. The sweep phase also needs to update the object start offset map used for the card marking write barrier. When collecting a young generation, the garbage collector scans the set of marked cards. It needs to know where the first object in each card is, so that it can properly identify pointers. This information is maintained by the object_start_map class defined in object_start_map.cpp. If an object that happens to be the first object in a cardwas deallocated by the sweep phase, the object start map must be updated to point at a subsequent object in that card. This is done by the object_start_map::update_for_sweep() method.

Compact phase

The compact phase is optional; it only runs if the garbage collector determines that there is too much fragmentation, or if the user explicitly requests it. The compact phase does not require that the sweep phase has been run, only the mark phase. Like the sweep phase, the compact phase relies on the mark bitmap having been computed. Whereas the sweep phase identifies free blocks and adds them to the free list, the compact phase slides allocated space up so that all free space ends up at the end of the heap, in a single block. The compact phase has two steps. The first step computes a forwarding map; a data structure that can tell you the final destination of every heap block. It is easy to see that the final destination of every block can be determined from the number of set bits in the mark bitmap that precede it. Since the forwarding map is consulted frequently -- once for every pointer in every object that is live during compaction -- it is important that lookups are as fast as possible. The forwarding map should also be space-efficient. This completely rules out using a hashtable (with many small objects in the heap, it would grow to be almost as big as the heap itself) or simply scanning the bitmap and counting bits every time (since now compaction will become an O(n^2) algorithm). The correct solution is very simple, and well-known in language implementation circles, but I wasn't aware of it until I studied the Clozure Common Lisp garbage collector. You count the bits set in every group of 32 (or 64) bits in the mark bitmap, building an array of cumulative sums as you go. Then, to count the number of bits that are set up to a given element, you look up the pre-computed population count for the nearest 32 (or 64) bit boundary, and manually compute the population count for the rest. This gives you a forwarding map with O(1) lookup time. This algorithm relies on a fast population count algorithm; I used the standard CPU-independent technique in the popcount() function of bitwise_hacks.hpp. Nowadys, as of SSE 4.2, x86 CPUs even include a POPCNT instruction, but since compaction spends most of its time in memmove(), I didn't investigate if this would offer a speedup. It would require a CPUID check at runtime and the fallback would still need to be there for pre-Intel i7 CPUs and PowerPC, so it didn't seem worth the extra complexity to me. Once the forwarding map has been computed, objects can be moved up and pointers that they contain updated in one pass. A mark-compact cycle takes roughly twice as long as a mark-sweep, which is why I elected not to perform a mark-compact on every full collection. The latter leads to a simpler implementation (no sweep phase, and no free list; allocation in tenured space is done by bumping a pointer just as with a copying collector) however the performance penalty didn't seem worth the minor code size reduction to me.

Code heap compaction

Code heap compaction has been in Factor for a while, in the form of the save-image-and-exit word. This used an inefficient multi-pass algorithm (known as the "LISP-2 compaction algorithm") however since it only ran right before exiting Factor, it didn't matter too much. Now, code heap compaction can happen at any time as a result of heap fragmentation, and uses the same efficient algorithm as tenured space compaction. Compaction moves code around, and doing this at a time other than right before the VM exiting creates a few complications:
  • Return addresses in the callstack need to be updated.
  • If an inline cache misses, a new inline cache is compiled, and the call site for the old cache is patched. Inline cache construction allocates memory and can trigger a GC, which may in turn trigger a code heap compaction; if this occurs, the return address passed into the inline cache miss stub may have moved, and the code to patch the call site needs to be aware of this.
  • If a Factor callback is passed into C code, then moving code around in the code heap may invalidate the callback, and the garbage collector has no way to update any function pointers that C libraries might be referencing.
The solution to the first problem was straightforward and involved adding some additional code to the code heap compaction pass. The second problem is trickier. I added a new code_root smart pointer class, defined in code_roots.hpp, to complement the data_root smart pointer (see my blog post about moving the VM to C++ for details about that). The inline cache compiler wraps the return address in a code_root to ensure that if the code block that contains it is moved by any allocations, the return address can be updated properly. I solved the last problem with a layer of indirection (that's how all problems are solved in CS, right?). Callbacks are still allocated in the code heap, but the function pointer passed to C is actually stored in a separate "callback heap" where every block consists of a single jump instruction and nothing else. When a code heap compaction occurs, code blocks in the code heap might be moved around, and all jumps in the callback heap are updated. Blocks within the callback heap are never moved around (or even deallocated, since that isn't safe either).

New object alignment and tagging scheme

The last major change I wanted to discuss is that objects are now aligned on 16-byte boundaries, rather than 8-byte boundaries. This wastes more memory, but I've already offset most of the increase with some space optimizations, with more to follow. There are several benefits to this new system. First of all, the binary payload of byte array objects now begins on a 16-byte boundary, which allows SIMD intrinsics to use aligned access instructions, which are much faster. Second, it simplifies the machine code for inline caches and megamorphic method dispatch. Since the low 4 bits of every pointer are now clear, this allows all built-in VM types to fit in the pointer tag itself, so the logic to get a class descriptor for an object in a dispatch stub is very simple now; here is pseudo-code for the assembly that gets generated:
cell tag = ptr & 15; if(tag == tuple_tag) tag = ptr[cell - tuple_tag]; ... dispatch on tag ...


Ian H said...

"bitwise_hacks.hpp" links to the wrong place.

Anonymous said...

Why is the compact phase done in two steps?

Zeev said...

RE: log2. Why didn't you use gcc's intrinsic:
And since I didn't see how it's used: how do you deal with the case where input == 0? BSR (and the intrinsic) explicitly state the result is then undefined. Also: x & -x.