Sunday, December 27, 2009

Freeing Factor from gcc's embrace-and-extend C language extensions

I have completed a refactoring of the Factor VM, eliminating usage of gcc-specific language extensions, namely global register variables, and on x86-32, the regparm calling convention. My motivation for this is two-fold.

First of all, I'd like to have the option of compiling the Factor VM with compilers other than gcc, such as Visual Studio. While gcc runs on all of Factor's supported platforms and then some, setting up a build environment using the GNU toolchain on Windows takes a bit of work, especially on 64-bit Windows. Visual Studio will provide an easier alternative for people who wish to build Factor from source on that platform. In the future, I'd also like to try using Clang to build Factor.

The second reason is that the global register variable extension is poorly implemented in gcc. Anyone who has followed Factor development for a while will know that gcc bugs come up pretty frequently, and most of these seem to be related to global register variables. This is quite simply one of the less well-tested corners of gcc, and the gcc developers seem to mostly ignore optimizer bugs triggered by this feature.

The Factor VM used a pair of register variables to hold data stack and retain stack pointers. These are just ordinary fields in a struct now. Back in the days when Factor was interpreted and the interpreter was part of the VM, a lot of time was spent executing code within the VM itself, and keeping these pointers in registers was important. Nowadays the Factor implementation compiles to machine code even during interactive use, using a pair of compilers called the non-optimizing compiler and optimizing compiler. Code generated by Factor's compilers tends to dominate the performance profile, rather than code in the VM itself. Compiled code can utilize registers in any matter desired, and so it continues to access the data stack and retain stack through registers. To make it work with the modified VM, the compiler generates code for saving and restoring these registers in the VM's context structure before and after calls into the VM.

A few functions defined in the VM used gcc's regparm calling convention. Normally, on x86-32, function parameters are always passed in an array on the call stack in the esp register; regparm functions instead pass the first 1, 2 or 3 arguments in registers. Whether or not this results in a performance boost is debatable, but my motivation for using this feature was not performance but perceived simplicity. The optimizing compiler would generate calls to these functions, and the machine code generated is a little simpler if it can simply stick parameters in registers instead of storing them on the call stack. Eliminating regparm did not in reality make anything more complex, as only a few locations were affected and the issue was limited to the x86-32 backend only.

I'm pretty happy with how the refactoring turned out. It did not seem to affect performance at all, which was not too surprising, since code generated by Factor's optimizing compiler did not change, and the additional overhead surrounding Factor to C calls is lost in the noise.

My goal of getting Factor to build with other compilers is not quite achieved yet, however. While the gcc extensions are gone from C++ code, the VM still has some 800 lines of assembly source using GNU assembler syntax, in the files cpu-x86.32.S, cpu-x86.64.S, and cpu-ppc.S. This code includes fundamental facilities which cannot be implemented in C++, such as the main C to Factor call gate, the low-level set-callstack primitive used by the continuations implementation, and a few other things. The assembly source also has a few auxilliary CPU-dependent functions, for example for saving and restoring FPU flags, and detecting the SSE version on x86.

I plan on elimiating the assembly from the VM entirely, by having the Factor compiler generate this code instead. The non-optimizing compiler can generate things such as the C to Factor call gate. For the the remaining assembly routines, such as FPU feature access and SSE version detection, I plan on adding an "inline assembly" facility to Factor itself, much like gcc's asm statement. The result will be that the VM will be a pure C++ codebase, and the machine code generation will be entirely offloaded into Factor code, where it belongs. Factor's assembler DSL is much nicer to use than GNU assembler.

Sunday, December 06, 2009

Reducing image size by eliminating literal tables from code heap entries


The compiled code heap consists of code blocks which reference other code blocks and objects in the data heap. For example, consider the following word:
: hello ( -- ) "Hello world" print ;

The machine code for this word is the following:
000000010ef55f10: 48b87b3fa40d01000000  mov rax, 0x10da43f7b
000000010ef55f1a: 4983c608 add r14, 0x8
000000010ef55f1e: 498906 mov [r14], rax
000000010ef55f21: 48bb305ff50e01000000 mov rbx, 0x10ef55f30 (hello+0x20)
000000010ef55f2b: e9e0ffabff jmp 0x10ea15f10 (print)

The immediate operand of the first mov instruction (0x10da43f7b) is the address of the string "Hello world" in the data heap. The immediate operand of the last jmp instruction (0x10ea15f10) is the address of the machine code of the print word in the code heap.
Unlike some dynamic language JITs where all references to data and compiled code from machine code are done via indirection tables, Factor embeds the actual addresses of the data in the code. This means that the garbage collector needs to be able to find all pointers in a code heap block (for the "sweep" phase of garbage collection), and update them (for the "compact" phase).

Relocation tables

Associated to each code block is a relocation table, which tells the VM what instruction operands contain special values that it must be aware of. The relocation table is a list of triples, packed into a byte array:
  • The relocation type is an instance of the relocation_type enum in instruction_operands.hpp. This value tells the VM what kind of value to deposit in the operand -- possibilities include a data heap pointer, the machine code of a word, and so on.
  • The relocation class is an instance of the relocation_class enum in instruction_operands.hpp. This value tells the VM how the operand is represented -- the instruction format, whether or not it is a relative address, and such.
  • The relocation offset is a byte offset from the start of the code block where the value is to be stored.
Code that needs to inspect relocation table entries uses the each_instruction_operand() method defined in code_block.hpp. This is a template method which can accept any object overloading operator().

Literal tables

The next part is what I just finished refactoring. I'll describe the old approach first. The simplest way, and what Factor used until now, is the following. Relocation table entries that expect a parameter, such as those that deposit addresses from the data heap and code heap, take the parameter from a literal table associated to each code block. When the compiler compiles a word, it spits out some machine code and a literal table. It hands these off to the Factor VM. The "sweep" phase of the garbage collector traces each code block's literal table, and the "compact" phase, after updating the literal table, stores the operands back in each instruction referenced from the relocation table.

Eliminating literal tables

The problem with the old approach is that the garbage collector doesn't really need the literal table. The address of each data heap object and code block referenced from machine code is already stored in the machine code itself. Indeed, the only thing missing until now was a way to read instruction operands out of instructions. With this in place, code blocks no longer had to hold on to the literal table after being constructed. Each code block's literal table is only used to deposit the initial values into machine code when a code block is first compiled by the compiler. Subsequently, the literal table becomes garbage and is collected by the garbage collector. When tracing code blocks, the garbage collector traverses the instruction operands themselves, using the relocation table alone. In addition to the space savings gained by not keeping these arrays of literals around, another interesting side-effect of this refactoring is that a full garbage collection no longer resets generic word call sites back to the cold call entry point, which would effectively discard all inline caches (read about inline caching in Factor).

Coping with code redefinition

A call to a word is compiled as a direct jump in Factor. This means that if a word is redefined and recompiled, existing call sites need to be updated to point to the new definition. The implementation of this is slightly more subtle now that literal tables are gone. Every code block in the code heap has a reference to an owner object in its header (see the code_block struct in code_blocks.cpp). The owner is either a word or a quotation. Words and quotations, in turn, have a field which references a compiled code heap block. The latter always points at the most recent compiled definition of that object (note that quotations are only ever compiled once, because they are immutable. Words, however, can be redefined by reloading source files). At the end of each compilation unit, the code heap is traversed, and each code block is updated in turn. The code block's relocation table is consulted, and instruction operands which reference compiled code heap blocks are updated. Before this would be done by overwriting all call targets from the literal table. Now, this is accomplished by looking at the owner object of the current target, and then storing the owner's most recent code block back in the instruction operand. This is implemented in the update_word_references() method defined in code_blocks.cpp. In addition to helping with redefinition, the owner object reference is used to construct call stack traces.

Additional space savings in deployed binaries

Normally, every compiled code block references its owner object, so that code redefinition can work. This means that if a word or quotation is live, then the code block corresponding to its most recent definition will be live, and vice versa. In deployed images where the compiler and debugger have been stripped out, words cannot be redefined and stack traces are not needed, so the owner field can be stripped out. This means that a word object can get garbage collected at deploy time even if its compiled code block is called. As it turns out, most words are never used as objects, and can be stripped out in this manner. So the literal table removal has an even bigger effect in deployed images than development images.

Size comparison

The following table shows image sizes before and after this change on Mac OS X x86-64.
ImageBefore (Megabytes)After (Megabytes)
Development image92 Mb90 Mb
Minimal development image8.6 Mb8.2 Mb
Deployed hello-ui2.3 Mb1.5 Mb
Deployed bunny3.5 Mb3.1 Mb