Factor: a practical stack language: 2009

Sunday, December 27, 2009

Freeing Factor from gcc's embrace-and-extend C language extensions

I have completed a refactoring of the Factor VM, eliminating usage of gcc-specific language extensions, namely global register variables, and on x86-32, the regparm calling convention. My motivation for this is two-fold.

First of all, I'd like to have the option of compiling the Factor VM with compilers other than gcc, such as Visual Studio. While gcc runs on all of Factor's supported platforms and then some, setting up a build environment using the GNU toolchain on Windows takes a bit of work, especially on 64-bit Windows. Visual Studio will provide an easier alternative for people who wish to build Factor from source on that platform. In the future, I'd also like to try using Clang to build Factor.

The second reason is that the global register variable extension is poorly implemented in gcc. Anyone who has followed Factor development for a while will know that gcc bugs come up pretty frequently, and most of these seem to be related to global register variables. This is quite simply one of the less well-tested corners of gcc, and the gcc developers seem to mostly ignore optimizer bugs triggered by this feature.

The Factor VM used a pair of register variables to hold data stack and retain stack pointers. These are just ordinary fields in a struct now. Back in the days when Factor was interpreted and the interpreter was part of the VM, a lot of time was spent executing code within the VM itself, and keeping these pointers in registers was important. Nowadays the Factor implementation compiles to machine code even during interactive use, using a pair of compilers called the non-optimizing compiler and optimizing compiler. Code generated by Factor's compilers tends to dominate the performance profile, rather than code in the VM itself. Compiled code can utilize registers in any matter desired, and so it continues to access the data stack and retain stack through registers. To make it work with the modified VM, the compiler generates code for saving and restoring these registers in the VM's context structure before and after calls into the VM.

A few functions defined in the VM used gcc's regparm calling convention. Normally, on x86-32, function parameters are always passed in an array on the call stack in the esp register; regparm functions instead pass the first 1, 2 or 3 arguments in registers. Whether or not this results in a performance boost is debatable, but my motivation for using this feature was not performance but perceived simplicity. The optimizing compiler would generate calls to these functions, and the machine code generated is a little simpler if it can simply stick parameters in registers instead of storing them on the call stack. Eliminating regparm did not in reality make anything more complex, as only a few locations were affected and the issue was limited to the x86-32 backend only.

I'm pretty happy with how the refactoring turned out. It did not seem to affect performance at all, which was not too surprising, since code generated by Factor's optimizing compiler did not change, and the additional overhead surrounding Factor to C calls is lost in the noise.

My goal of getting Factor to build with other compilers is not quite achieved yet, however. While the gcc extensions are gone from C++ code, the VM still has some 800 lines of assembly source using GNU assembler syntax, in the files cpu-x86.32.S, cpu-x86.64.S, and cpu-ppc.S. This code includes fundamental facilities which cannot be implemented in C++, such as the main C to Factor call gate, the low-level set-callstack primitive used by the continuations implementation, and a few other things. The assembly source also has a few auxilliary CPU-dependent functions, for example for saving and restoring FPU flags, and detecting the SSE version on x86.

I plan on elimiating the assembly from the VM entirely, by having the Factor compiler generate this code instead. The non-optimizing compiler can generate things such as the C to Factor call gate. For the the remaining assembly routines, such as FPU feature access and SSE version detection, I plan on adding an "inline assembly" facility to Factor itself, much like gcc's asm statement. The result will be that the VM will be a pure C++ codebase, and the machine code generation will be entirely offloaded into Factor code, where it belongs. Factor's assembler DSL is much nicer to use than GNU assembler.

Sunday, December 06, 2009

Reducing image size by eliminating literal tables from code heap entries

Introduction

The compiled code heap consists of code blocks which reference other code blocks and objects in the data heap. For example, consider the following word:

: hello ( -- ) "Hello world" print ;

The machine code for this word is the following:

000000010ef55f10: 48b87b3fa40d01000000  mov rax, 0x10da43f7b
000000010ef55f1a: 4983c608              add r14, 0x8
000000010ef55f1e: 498906                mov [r14], rax
000000010ef55f21: 48bb305ff50e01000000  mov rbx, 0x10ef55f30 (hello+0x20)
000000010ef55f2b: e9e0ffabff            jmp 0x10ea15f10 (print)

The immediate operand of the first mov instruction (0x10da43f7b) is the address of the string "Hello world" in the data heap. The immediate operand of the last jmp instruction (0x10ea15f10) is the address of the machine code of the print word in the code heap.
Unlike some dynamic language JITs where all references to data and compiled code from machine code are done via indirection tables, Factor embeds the actual addresses of the data in the code. This means that the garbage collector needs to be able to find all pointers in a code heap block (for the "sweep" phase of garbage collection), and update them (for the "compact" phase).

Relocation tables

Associated to each code block is a relocation table, which tells the VM what instruction operands contain special values that it must be aware of. The relocation table is a list of triples, packed into a byte array:

The relocation type is an instance of the relocation_type enum in instruction_operands.hpp. This value tells the VM what kind of value to deposit in the operand -- possibilities include a data heap pointer, the machine code of a word, and so on.
The relocation class is an instance of the relocation_class enum in instruction_operands.hpp. This value tells the VM how the operand is represented -- the instruction format, whether or not it is a relative address, and such.
The relocation offset is a byte offset from the start of the code block where the value is to be stored.

Code that needs to inspect relocation table entries uses the each_instruction_operand() method defined in code_block.hpp. This is a template method which can accept any object overloading operator().

Literal tables

The next part is what I just finished refactoring. I'll describe the old approach first. The simplest way, and what Factor used until now, is the following. Relocation table entries that expect a parameter, such as those that deposit addresses from the data heap and code heap, take the parameter from a literal table associated to each code block. When the compiler compiles a word, it spits out some machine code and a literal table. It hands these off to the Factor VM. The "sweep" phase of the garbage collector traces each code block's literal table, and the "compact" phase, after updating the literal table, stores the operands back in each instruction referenced from the relocation table.

Eliminating literal tables

The problem with the old approach is that the garbage collector doesn't really need the literal table. The address of each data heap object and code block referenced from machine code is already stored in the machine code itself. Indeed, the only thing missing until now was a way to read instruction operands out of instructions. With this in place, code blocks no longer had to hold on to the literal table after being constructed. Each code block's literal table is only used to deposit the initial values into machine code when a code block is first compiled by the compiler. Subsequently, the literal table becomes garbage and is collected by the garbage collector. When tracing code blocks, the garbage collector traverses the instruction operands themselves, using the relocation table alone. In addition to the space savings gained by not keeping these arrays of literals around, another interesting side-effect of this refactoring is that a full garbage collection no longer resets generic word call sites back to the cold call entry point, which would effectively discard all inline caches (read about inline caching in Factor).

Coping with code redefinition

A call to a word is compiled as a direct jump in Factor. This means that if a word is redefined and recompiled, existing call sites need to be updated to point to the new definition. The implementation of this is slightly more subtle now that literal tables are gone. Every code block in the code heap has a reference to an owner object in its header (see the code_block struct in code_blocks.cpp). The owner is either a word or a quotation. Words and quotations, in turn, have a field which references a compiled code heap block. The latter always points at the most recent compiled definition of that object (note that quotations are only ever compiled once, because they are immutable. Words, however, can be redefined by reloading source files). At the end of each compilation unit, the code heap is traversed, and each code block is updated in turn. The code block's relocation table is consulted, and instruction operands which reference compiled code heap blocks are updated. Before this would be done by overwriting all call targets from the literal table. Now, this is accomplished by looking at the owner object of the current target, and then storing the owner's most recent code block back in the instruction operand. This is implemented in the update_word_references() method defined in code_blocks.cpp. In addition to helping with redefinition, the owner object reference is used to construct call stack traces.

Additional space savings in deployed binaries

Normally, every compiled code block references its owner object, so that code redefinition can work. This means that if a word or quotation is live, then the code block corresponding to its most recent definition will be live, and vice versa. In deployed images where the compiler and debugger have been stripped out, words cannot be redefined and stack traces are not needed, so the owner field can be stripped out. This means that a word object can get garbage collected at deploy time even if its compiled code block is called. As it turns out, most words are never used as objects, and can be stripped out in this manner. So the literal table removal has an even bigger effect in deployed images than development images.

Size comparison

The following table shows image sizes before and after this change on Mac OS X x86-64.

Image	Before (Megabytes)	After (Megabytes)
Development image	92 Mb	90 Mb
Minimal development image	8.6 Mb	8.2 Mb
Deployed hello-ui	2.3 Mb	1.5 Mb
Deployed bunny	3.5 Mb	3.1 Mb

Monday, November 16, 2009

Mark-compact garbage collection for oldest generation, and other improvements

Factor now uses a mark-sweep-compact garbage collector for the oldest generation (known as tenured space), in place of the copying collector that it used before. This reduces memory usage. The mark-sweep collector used for the code heap has also been improved. It now shares code with the data heap collector and uses the same compaction algorithm.

Mark-sweep-compact garbage collection for tenured space

Mark phase

During the mark phase, the garbage collector computes the set of reachable objects by starting from the "roots"; global variables, and objects referenced from runtime stacks. When an object is visited, the mark phase checks if the object has already been marked. If it hasn't yet been marked, it is marked, and any object that it refers to are then also visited in turn. If an object has already been marked, nothing is done. As described above, the algorithm is recursive, which is problematic. There are two approaches to turn it into an iterative algorithm; either objects yet to be visited are pushed on a mark stack, and a top-level loop drains this stack, or a more complicated scheme known as "pointer inversion" is used. I decided against pointer inversion, since the mark stack approach is simpler and I have yet to observe the mark stack grow beyond 128Kb or so anyway. I use an std::vector for the mark stack and it works well enough. The mark stack and the loop that drains it are implemented in full_collector.cpp. There are several approaches to representing the set of objects which are currently marked, also. The two most common are storing a mark bit for each object in the object's header, and storing mark bits in a bitmap off to the side. I chose the latter approach, since it speeds up the sweep and compact phases of the garbage collector, and doesn't require changing object header layout. Each bit in the mark bitmap corresponds to 16 bytes of heap space. For reasons that will become clear, when an object is marked mark bitmap, every bit corresponding to space taken up by the object is marked, not just the first bit. The mark bitmap is implemented in mark_bits.hpp.

Sweep phase

Once the mark phase is complete, the mark bitmap now has an up-to-date picture of what regions in the heap correspond to reachable objects, and which are free. The sweep phase begins by clearing the free list, and then computes a new one by traversing the mark bitmap. For every contiguous range of clear bits, a new entry is added to the free list. This is why I use a mark bitmap instead of mark bits in object headers; if I had used mark bits in headers, then the sweep phase would need to traverse the entire heap, not just the mark bitmap, which has significantly worse locality. The sweep phase is implemented by the free_list_allocator::sweep() method in free_list_allocator.hpp. The key to an efficient implementation of the sweep algorithm is the log2() function in bitwise_hacks.hpp; it is used to find the first set bit in a cell. I use the BSR instruction on x86 and cntlzw on PowerPC. The sweep phase also needs to update the object start offset map used for the card marking write barrier. When collecting a young generation, the garbage collector scans the set of marked cards. It needs to know where the first object in each card is, so that it can properly identify pointers. This information is maintained by the object_start_map class defined in object_start_map.cpp. If an object that happens to be the first object in a cardwas deallocated by the sweep phase, the object start map must be updated to point at a subsequent object in that card. This is done by the object_start_map::update_for_sweep() method.

Compact phase

The compact phase is optional; it only runs if the garbage collector determines that there is too much fragmentation, or if the user explicitly requests it. The compact phase does not require that the sweep phase has been run, only the mark phase. Like the sweep phase, the compact phase relies on the mark bitmap having been computed. Whereas the sweep phase identifies free blocks and adds them to the free list, the compact phase slides allocated space up so that all free space ends up at the end of the heap, in a single block. The compact phase has two steps. The first step computes a forwarding map; a data structure that can tell you the final destination of every heap block. It is easy to see that the final destination of every block can be determined from the number of set bits in the mark bitmap that precede it. Since the forwarding map is consulted frequently -- once for every pointer in every object that is live during compaction -- it is important that lookups are as fast as possible. The forwarding map should also be space-efficient. This completely rules out using a hashtable (with many small objects in the heap, it would grow to be almost as big as the heap itself) or simply scanning the bitmap and counting bits every time (since now compaction will become an O(n^2) algorithm). The correct solution is very simple, and well-known in language implementation circles, but I wasn't aware of it until I studied the Clozure Common Lisp garbage collector. You count the bits set in every group of 32 (or 64) bits in the mark bitmap, building an array of cumulative sums as you go. Then, to count the number of bits that are set up to a given element, you look up the pre-computed population count for the nearest 32 (or 64) bit boundary, and manually compute the population count for the rest. This gives you a forwarding map with O(1) lookup time. This algorithm relies on a fast population count algorithm; I used the standard CPU-independent technique in the popcount() function of bitwise_hacks.hpp. Nowadys, as of SSE 4.2, x86 CPUs even include a POPCNT instruction, but since compaction spends most of its time in memmove(), I didn't investigate if this would offer a speedup. It would require a CPUID check at runtime and the fallback would still need to be there for pre-Intel i7 CPUs and PowerPC, so it didn't seem worth the extra complexity to me. Once the forwarding map has been computed, objects can be moved up and pointers that they contain updated in one pass. A mark-compact cycle takes roughly twice as long as a mark-sweep, which is why I elected not to perform a mark-compact on every full collection. The latter leads to a simpler implementation (no sweep phase, and no free list; allocation in tenured space is done by bumping a pointer just as with a copying collector) however the performance penalty didn't seem worth the minor code size reduction to me.

Code heap compaction

Code heap compaction has been in Factor for a while, in the form of the save-image-and-exit word. This used an inefficient multi-pass algorithm (known as the "LISP-2 compaction algorithm") however since it only ran right before exiting Factor, it didn't matter too much. Now, code heap compaction can happen at any time as a result of heap fragmentation, and uses the same efficient algorithm as tenured space compaction. Compaction moves code around, and doing this at a time other than right before the VM exiting creates a few complications:

Return addresses in the callstack need to be updated.
If an inline cache misses, a new inline cache is compiled, and the call site for the old cache is patched. Inline cache construction allocates memory and can trigger a GC, which may in turn trigger a code heap compaction; if this occurs, the return address passed into the inline cache miss stub may have moved, and the code to patch the call site needs to be aware of this.
If a Factor callback is passed into C code, then moving code around in the code heap may invalidate the callback, and the garbage collector has no way to update any function pointers that C libraries might be referencing.

The solution to the first problem was straightforward and involved adding some additional code to the code heap compaction pass. The second problem is trickier. I added a new code_root smart pointer class, defined in code_roots.hpp, to complement the data_root smart pointer (see my blog post about moving the VM to C++ for details about that). The inline cache compiler wraps the return address in a code_root to ensure that if the code block that contains it is moved by any allocations, the return address can be updated properly. I solved the last problem with a layer of indirection (that's how all problems are solved in CS, right?). Callbacks are still allocated in the code heap, but the function pointer passed to C is actually stored in a separate "callback heap" where every block consists of a single jump instruction and nothing else. When a code heap compaction occurs, code blocks in the code heap might be moved around, and all jumps in the callback heap are updated. Blocks within the callback heap are never moved around (or even deallocated, since that isn't safe either).

New object alignment and tagging scheme

The last major change I wanted to discuss is that objects are now aligned on 16-byte boundaries, rather than 8-byte boundaries. This wastes more memory, but I've already offset most of the increase with some space optimizations, with more to follow. There are several benefits to this new system. First of all, the binary payload of byte array objects now begins on a 16-byte boundary, which allows SIMD intrinsics to use aligned access instructions, which are much faster. Second, it simplifies the machine code for inline caches and megamorphic method dispatch. Since the low 4 bits of every pointer are now clear, this allows all built-in VM types to fit in the pointer tag itself, so the logic to get a class descriptor for an object in a dispatch stub is very simple now; here is pseudo-code for the assembly that gets generated:

cell tag = ptr & 15; if(tag == tuple_tag) tag = ptr[cell - tuple_tag]; ... dispatch on tag ...

Thursday, October 15, 2009

Improved write barriers in Factor's garbage collector

The Factor compiler has been evolving very quickly lately; it has been almost completely rewritten several times in the last couple of years. The garbage collector, on the other hand, hasn't seen nearly as much action. The last time I did any work on it was May 2008, and before that, May 2007. Now more than a year later, I've devoted a week or so to working on it. The result is a cleaner implementation, and improved performance.

Code restructuring

I re-organized the garbage collector code to be more extensible and easier to maintain. I did this by splitting off a bunch of the garbage collector methods from the factor_vm class into their own set of classes. I made extensive use of template metaprogramming to help structure code in a natural way. Many people dislike C++, primarily because of templates, but I don't feel that way at all. Templates are my favorite C++ feature, and if it wasn't for templates C++ would just be a shitty object-oriented dialect of C.

First up is the collector template class, defined in collector.hpp:

template<typename TargetGeneration, typename Policy> struct collector

This class has two template parameters:

TargetGeneration - this is the generation that the collector will be copying objects to. A generation is any class that implements the allot() method.
Policy - this is a class that simulates a higher-order function. It implements a should_copy_p() method that tells the collector if a given object should be promoted to the target generation, or left alone.

On its own, the collector class can't do any garbage collection itself; it just implements methods which trace GC roots: trace_contexts() (traces active stacks), trace_roots() (traces global roots), and trace_handle() (traces one pointer).

Next up is the copying_collector template class, defined in copying_collector.hpp:

template<typename TargetGeneration, typename Policy> struct copying_collector

This class has the same two template parameters as collector; the target generation must define one additional method, next_object_after(). This is used when scanning newly copied objects. This class implements logic for scanning marked cards, as well as Cheney's algorithm for copying garbage collection.

Then, there are four subclasses implementing each type of GC pass:

nursery_collector - copies live objects from the nursery into aging space, defined in nursery_collector.hpp and nursery_collector.cpp
aging_collector - copies live objects from the first aging semi-space to the second aging semi-space, defined in aging_collector.hpp and aging_collector.cpp
to_tenured_collector - copies live objects from aging space into tenured space, defined in to_tenured_collector.hpp and to_tenured_collector.cpp
full_collector - copies live objects from the first tenured semi-space to the second tenured semi-space, defined in nursery_collector.hpp and nursery_collector.cpp

Each class subclasses copying_collector and specializes the two template arguments. For example, let's take a look at the declaration of the nursery collector:

struct nursery_collector : copying_collector<aging_space,nursery_policy>

This subclass specializes its superclass to copy objects to tenured space, using the following policy class:

struct nursery_policy {
 factor_vm *myvm;

 nursery_policy(factor_vm *myvm_) : myvm(myvm_) {}

 bool should_copy_p(object *untagged)
 {
  return myvm->nursery.contains_p(untagged);
 }
};

That is, any object that is in the nursery will be copied to aging space by the nursery_collector. Other collector subclasses are similar.

This code all uses templates, rather than virtual methods, so every collector pass will have a specialized code path generated for it. This gives higher performance, with cleaner code, than was is possible in C. The old garbage collector was a tangle of conditionals, C functions, and global state.

Partial array card marking

When a pointer to an object in the nursery is stored into a container in aging or tenured space, the container object must be added to a "remembered set" so that on the next minor collection, so that it can be scanned, and its elements considered as GC roots.

Old way

Storing a pointer into an object marks the card containing the header of the object. On a minor GC, all marked cards are be scanned; if a marked card was bound, then every object whose header is contained in this card would be scanned.

Problem

Storing a pointer into an array would necessitate the array to be scanned in its entirety on the next minor collection. This is bad if the array is large. Consider an algorithm that stores successive elements into an array on every iteration, and also performs enough work per iteration to trigger a nursery collection. Now every nursery collection -- and hence every iteration of the loop -- is scanning the entire array. We're doing a quadratic amount of work for what should be a linear-time algorithm.

New way

Storing a pointer into an object now marks the card containing the slot that was mutated. On a minor GC, all marked cards are scanned. Every object in every marked card is inspected, but only the subrange of slots that fit inside the card are scanned. This greatly reduces the burden placed on the GC from mutation of large arrays. The implementation is tricky. I need to spend some time thinking about and simplifying the code, as it stands the card scanning routine has three nested loops, and two usages of goto!

Implementation

copying_collector.hpp, trace_cards()

New object promotion policy

When aging space is being collected, objects contained in marked cards in tenured space must be traced.

Old way

These cards would be scanned, but could not be unmarked, since the objects they refer to were copied to the other aging semi-space, and would need to be traced on the next aging collection.

The problem

The old way was bad because these cards would remain marked for a long time, and would be re-scanned on every collection. Furthermore the objects they reference would likely live on for a long time, since they're referenced from a tenured object, and would needlessly bounce back and forth between the two aging semi-spaces.

New way

Instead, an aging collection proceeds in two phases: the first phase promotes aging space objects referenced from tenured space to tenured space, unmarking all marked cards. The second phase copies all reachable objects from aging to second aging semi-space. This promotes objects likely to live for a long time all the way to tenured space, and scans less cards on an aging collection since more cards can get unmarked.

Implementation

aging_collector.cpp

Faster code heap remembered set

If a code block references objects in the nursery, the code block needs to be updated after a nursery collection. This is because the machine code of compiled words directly refers to objects; there's no indirection through a literal table at runtime. This improves performance but increases garbage collector complexity.

Old way

When a new code block was allocated, a global flag would be set. A flag would also be set in the code block's header. On the next nursery collection, the entire code heap would be scanned, and any code blocks with this flag set in them would have their literals traced by the garbage collector.

New way

The problem with the old approach is that adding a single code block entails a traversal of the entire code heap on the next minor GC, which is bad for cache. While most code does not allocate in the code heap, the one major exception is the compiler itself. When loading source code, a significant portion of running time was spent scanning the code heap during minor collections. Now, the list of code blocks containing literals which refer to the nursery and aging spaces are stored in a pair of STL sets. On a nursery or aging collection, these sets are traversed and the code blocks they contain are traced. These sets are typically very small, and in fact empty most of the time.

Implementation

code_heap.cpp, write_barrier()
copying_collector.hpp, trace_code_heap_roots()

Faster card marking and object allocation

The compiler's code generator now generates tighter code for common GC-related operations too. A write barrier looks like this in pseudo-C:

cards[(obj - data_heap_start) >> card_bits] = card_mark_mask;

Writing out the pointer arithmetic by hand, we get:

*(cards + (obj - data_heap_start) >> card_bits) = card_mark_mask;

Re-arranging some operations,

*(obj >> card_bits + (cards - data_heap_start >> card_bits) = card_mark_mask;

Now, the entire expression

(cards - data_heap_start >> card_bits)

is a constant. Factor stores this in a VM-global variable, named cards_offset. The value used to be loaded from the global variable every time a write barrier would execute. Now, its value is inlined directly into machine code. This requires code heap updates if the data heap grows, since then either the data heap start or the card array base pointer might change. However, the upside is that it eliminates several instructions from the write barrier. Here is a sample generated write barrier sequence; only 5 instructions on x86.32:

0x08e3ae84: lea    (%edx,%eax,1),%ebp
0x08e3ae87: shr    $0x8,%ebp
0x08e3ae8a: movb   $0xc0,0x20a08000(%ebp)
0x08e3ae91: shr    $0xa,%ebp
0x08e3ae94: movb   $0xc0,0x8de784(%ebp)

Object allocations had a slight inefficiency; the code generated to compute the effective address of the nursery allocation pointer did too much arithmetic. Adding support to the VM for embedding offsets of VM global variables directly into machine code saved one instruction from every object allocation. Here is some generated machine code to box a floating point number; only 6 instructions on x86.32 (of course Factor does float unboxing to make your code even faster):

0x08664a33: mov    $0x802604,%ecx
0x08664a38: mov    (%ecx),%eax
0x08664a3a: movl   $0x18,(%eax)
0x08664a40: or     $0x3,%eax
0x08664a43: addl   $0x10,(%ecx)
0x08664a46: movsd  %xmm0,0x5(%eax)

Implementation

cpu/x86/x86.factor: %write-barrier, %allot
cpu/ppc/ppc.factor: %write-barrier, %allot

Performance comparison

I compared the performance of a Mac OS X x86-32 build from October 5th, with the latest sources as of today.

Bootstrap time saw a nice boost, going from 363 seconds, down to 332 seconds.

The time to load and compile all libraries in the source tree (load-all) was reduced from 537 seconds to 426 seconds.

Here is a microbenchmark demonstrating the faster card marking in a very dramatic way:

: bench ( -- seq ) 10000000 [ >bignum 1 + ] map ;

The best time out of 5 iterations on the old build was 12.9 seconds. Now, it has gone down to 1.9 seconds.

Wednesday, September 30, 2009

A survey of domain-specific languages in Factor

Factor has good support for implementing mini-languages as libraries. In this post I'll describe the general techniques and look at some specific examples. I don't claim any of this is original research -- Lisp and Forth programmers have been doing DSLs for decades, and recently the Ruby, Haskell and even Java communities are discovering some of these concepts and adding a few of their own to the mix. However, I believe Factor brings some interesting incremental improvements to the table, and the specific combination of capabilities found in Factor is unique.

Preliminaries

How does one embed a mini-language in Factor? Let us look at what goes on when a source file is parsed:

The parser reads the input program. This consists of definitions and top-level forms. The parser constructs syntax trees, and adds definitions to the dictionary. The result of parsing is the set of top-level forms in the file.
The compiler is run with all changed definitions. The compiler essentially takes syntax trees as input, and produces machine code as output. Once the compiler finishes compiling the new definitions, they are added to the VM's code heap and may be executed.
The top level forms in the file are run, if any.

In Factor, all of these stages are extensible. Note that all of this happens when the source file is loaded into memory -- Factor is biased towards compile-time meta-programming.

Extending the parser with parsing words

Parsing words which execute at parse time can be defined. Parsing words can take over the parser entirely and parse custom syntax. All of Factor's standard syntax, such as : for defining words and [ for reading a quotation, is actually parsing words defined in the syntax vocabulary. Commonly-used libraries such as memoize and specialized-arrays add their own parsing words for custom definitions and data types. These don't qualify as domain-specific languages since they're too trivial, but they're very useful.

Homoiconic syntax

In most mainstream languages, the data types found in the syntax tree are quite different from the data types you operate on at runtime. For example, consider the following Java program:

if(x < 3) { return x + 3; } else { return foo(x); }

This might parse into something like

IfNode(
    condition: LessThan(Identifier(x),Integer(3))
    trueBranch: ReturnNode(Add(Identifier(x),Integer(3)))
    falseBranch: ReturnNode(MethodCall(foo,Identifier(x)))
)

You cannot work with an "if node" in your program, the identifier x does not exist at runtime, and so on.

In Factor and Lisp, the parser constructs objects such as strings, numbers and lists directly, and identifiers (known as symbols in Lisp, and words in Factor) are first-class types. Consider the following Factor code,

x 3 < [ x 3 + ] [ x foo ] if

This parses as a quotation of 6 elements,

The word x
The integer 3
The word <
A quotation with three elements, x, 3, and +
A quotation with two elements, x, and foo
The word if

An array literal, like { 1 2 3 }, parses as an array of three integers; not an AST node representing an array with three child AST nodes representing integers.

The flipside of homoiconic syntax is being able to print out (almost) any data in a form that can be parsed back in; in Factor, the . word does this.

What homoiconic syntax gives you as a library author is the ability to write mini-languages without writing a parser. Since you can input almost any data type wth literal syntax, you can program your mini-language in parse trees directly. The mini-language can process nested arrays, quotations, tuples and words instead of parsing a string representation first.

Compile-time macros

All of Factor's data types, including quotations and words, can be constructed at runtime. Factor also supports compile-time macros in the Lisp sense, but unlike Lisp where they are used to prevent evaluation, a Factor macro is called just like an ordinary word, except the parameters have to be compile-time literals. The macro evaluates to a quotation, and the quotation is compiled in place of the macro call.

Everything that can be done with macros can also be done by constructing quotations at runtime and using call( -- macros just provide a speed boost.

Parsing word-based DSLs

Many DSLs have a parsing word as their main entry point. The parsing word either takes over the parser completely to parse custom syntax, or it defines some new words and then lets the main parser take over again.

Local variables

A common misconception among people taking a casual look at Factor is that it doesn't offer any form of lexical scoping or named values at all. For example, Reg Braithwaite authoritatively states on his weblog:

the Factor programming language imposes a single, constraining set of rules on the programmer: programmers switching to Factor must relinquish their local variables to gain Factor’s higher-order programming power.

In fact, Factor supports lexically scoped local variables via the locals vocabulary. and this library is used throughout the codebase. It looks like in a default image, about 1% of all words use lexical variables.

The locals vocabulary implements a set of parsing words which augment the standard defining words. For example, :: reads a word definition where input arguments are stored in local variables, instead of being on the stack:

:: lerp ( a b c -- d )
    a c *
    b 1 c - *
    + ;

The locals vocabulary also supports "let" statements, lambdas with full lexical closure semantics, and mutable variables. The locals vocabulary compiles lexical variable usage down to stack shuffling, and curry calls (for constructing quotations that close over a variable). This makes it quite efficient, especially since in many cases the Factor compiler can eliminate the closure construction using escape analysis. The choice of whether or not to use locals is one that can be made purely on a stylistic basis, since it has very little effect on performance.

Parsing expression grammars

Parsing expression grammars describe a certain class of languages, as well as a formalism for parsing these languages.

Chris Double implemented a PEG library in Factor. The peg vocabulary offers a combinator-style interface for constructing parsers, and peg.ebnf builds on top of this and defines a declarative syntax for specifying parsers.

A simple example of a PEG grammar can be found in Chris Double's peg.pl0 vocabulary. More elaborate examples can be found in peg.javascript (JavaScript parser by Chris Double) and smalltalk.parser (Smalltalk parser by me).

One downside of PEGs is that they have some performance problems; the standard formulation has exponential runtime in the worst case, and the "Packrat" variant that Factor uses runs in linear time but also linear space. For heavy-duty parsing, it appears as if LL and LR parsers are best, and it would be nice if Factor had an implementation of such a parser generator.

However, PEGs are still useful for simple parsing tasks and prototyping. They are used throughout the Factor codebase for many things, including but not limited to:

Parsing regular expressions in the regexp vocabulary (source)
Parsing URLs in the urls vocabulary (source)
Parsing HTTP headers in the http.server and http.client vocabularies (source)

PEGs can also be used in conjunction with parsing words to embed source code written with custom grammars in Factor source files directly. The next DSL is an example of that.

Infix expressions

Philipp Brüschweiler's infix vocabulary defines a parsing word which parses an infix math expression using PEGs. The result is then compiled down to locals, which in turn compile down to stack code.

Here is a word which solves a quadratic equation ax^2 + bx + c = 0 using the quadratic formula. Infix expressions can only return one value, so this word computes the first root only:

USING: infix locals ;

:: solve-quadratic ( a b c -- r )
    [infix (-b + sqrt(b*b-4*a*c))/2*a infix] ;

Note that we're using two mini-languages here; :: begins a definition with named parameters stored in local variables, and [infix parses an infix expression which can access these local variables.

XML literals

Daniel Ehrenberg's XML library defines a convenient syntax for constructing XML documents. Dan describes it in detail in a blog post, with plenty of examples, so I won't repeat it here. The neat thing here is that by adding a pair of parsing words, [XML and <XML, he was able to integrate XML snippets into Factor, with parse-time well-formedness checking, no less.

Dan's XML library is now used throughout the Factor codebase, particularly in the web framework, for both parsing and generating XML. For example, the concatenative.org wiki uses a markup language called "farkup". The farkup markup language, developed by Dan, Doug Coleman and myself, makes heavy use of XML literals. Farkup is implemented by first parsing the markup into an abstract syntax tree, and then converting this to HTML using a recursive tree walk that builds XML literals. We avoid constructing XML and HTML through raw string concatenation; instead we use XML literals everywhere now. This results in cleaner, more secure code.

Compare the design of our farkup library with markdown.py used by reddit.com. The latter is implemented with a series of regular expression hacks and lots of ad-hoc string processing which attempts to produce something resembling HTML in the end. New markup injection attacks are found all the time; there was a particularly clever one involving a JavaScript worm that knocked reddit right down a few days ago. I don't claim that Farkup is 100% secure by any means, and certainly it has had an order of magnitude less testing, but without a doubt centralizing XHTML generation makes it much easier to audit and identify potential injection problems.

C library interface

Our C library interface (or FFI) was quite low-level initially but after a ton of work by Alex Chapman, Joe Groff, and myself, it has quite a DSLish flavor. C library bindings resemble C header files on mushrooms. Here is a taste:

TYPEDEF: int cairo_bool_t

CONSTANT: CAIRO_CONTENT_COLOR HEX: 1000
CONSTANT: CAIRO_CONTENT_ALPHA HEX: 2000
CONSTANT: CAIRO_CONTENT_COLOR_ALPHA HEX: 3000

STRUCT: cairo_matrix_t
    { xx double }
    { yx double }
    { xy double }
    { yy double }
    { x0 double }
    { y0 double } ;
    
FUNCTION: void
cairo_transform ( cairo_t* cr, cairo_matrix_t* matrix ) ;

The Factor compiler generates stubs for calling C functions on the fly from these declarative descriptions; there is no C code generator and no dependency on a C compiler. In fact, C library bindings are so easy to write that for many contributors, it is their first project in Factor. When Doug Coleman first got involved in Factor, he began by writing a PostgreSQL binding, followed by an implementation of the MD5 checksum. Both libraries have been heavily worked on since then are still in use.

For complete examples of FFI usage, check out any of the following:

unix - POSIX bindings
windows - Windows API bindings
db.sqlite.ffi - SQLite bindings
opengl.gl - OpenGL bindings

There are many more usages of the FFI of course. Since Factor has a minimal VM, all I/O, graphics and interaction with the outside world in general is done with the FFI. Search the Factor source tree for source files that use the alien.syntax vocabulary.

GPU shaders

Joe Groff cooked up a nice DSL for passing uniform parameters to pixel and vertex shaders. In his blog post, Joe writes:

The library makes it easy to load and interactively update shaders, define binary formats for GPU vertex buffers, and feed parameters to shader code using Factor objects.

Here is a snippet from the gpu.demos.raytrace demo:

GLSL-SHADER-FILE: raytrace-vertex-shader vertex-shader "raytrace.v.glsl"
GLSL-SHADER-FILE: raytrace-fragment-shader fragment-shader "raytrace.f.glsl"
GLSL-PROGRAM: raytrace-program
    raytrace-vertex-shader raytrace-fragment-shader ;

UNIFORM-TUPLE: sphere-uniforms
    { "center" vec3-uniform  f }
    { "radius" float-uniform f }
    { "color"  vec4-uniform  f } ;

UNIFORM-TUPLE: raytrace-uniforms
    { "mv-inv-matrix"    mat4-uniform f }
    { "fov"              vec2-uniform f }
    
    { "spheres"          sphere-uniforms 4 }

    { "floor-height"     float-uniform f }
    { "floor-color"      vec4-uniform 2 }
    { "background-color" vec4-uniform f }
    { "light-direction"  vec3-uniform f } ;

The GLSL-SHADER-FILE: parsing word tells Factor to load a GLSL shader program. The GPU framework automatically checks the file for modification, reloading it if necessary.

The UNIFORM-TUPLE: parsing word defines a new tuple class, together with methods which destructure the tuple and bind textures and uniform parameters. Uniform parameters are named as such because they define values which remain constant at every pixel or vertex that the shader program operates on.

Instruction definitions in the compiler

This one is rather obscure and technical, but it has made my job easier over the last few weeks. I blogged about it already.

Other examples

The next set of DSLs don't involve parsing words as much as just clever tricks with evaluation semantics.

Inverse

Daniel Ehrenberg's inverse library implements a form of pattern matching by computing the inverse of a Factor quotation. The fundamental combinator, undo, takes a Factor quotation, and executes it "in reverse". So if there is a constructed tuple on the stack, undoing the constructor will leave the slots on the stack. If the top of the stack doesn't match anything that the constructor could've produced, then the inverse fails, and pattern matching can move on to the next clause. This library works by introspecting quotations and the words they contain. Dan gives many details and examples in his paper on inverse.

Help system

Factor's help system uses an s-expression-like markup language. Help markup is parsed by the Factor parser without any special parsing words. A markup element is an array where the first element is a distinguished word and the rest are parameters. Examples:

"This is " { $strong "not" } " a good idea"

{ $list
    "milk"
    "flour"
    "eggs"
}

{ $link "help" }

This markup is rendered either directly on the Factor UI (like in this screenshot) or via HTML, as on the docs.factorcode.org site.

The nice thing about being able to view help in the UI environment is the sheer interactive nature of it. Unlike something like javadoc, there is no offline processing step which takes your source file and spits out rendered markup. You just load a vocabulary into your Factor instance and the documentation is available instantly. You can look at the help for any documented word by simply typing something like

\ append help

in the UI listener. While working on your own vocabulary, you can reload changes to the documentation and see them appear instantly in the UI's help browser.

Finally, it is worth mentioning that because of the high degree of semantic information encoded in documentation, many kinds of mistakes can be caught in an automated fashion. The help lint tool finds inconsistencies between the actual parameters that a function takes, and the documented parameters, as well as code examples that don't evaluate to what the documentation claims they evaluate to, and a few other things.

You won't find a lot of comments in Factor source, because the help system is much more useful. Instead of plain-text comments that can go out of date, why not have rich text with hyperlinks and semantic information?

For examples of help markup, look at any file whose name ends with -docs.factor in the Factor source tree. There are plenty.

x86 and PowerPC assemblers

I put this one last since its not really a DSL at all, just a nice API. The lowest level of Factor's compiler generates machine code from the compiler's intermediate form in a CPU-specific way. The CPU backends for x86 and PowerPC use the cpu.x86.assembler and cpu.ppc.assembler vocabularies for this, respectively. The way the assemblers work is that they define a set of words corresponding to CPU instructions. Instruction words takes operands from the stack -- which are objects representing registers, immediate values, and in the case of the x86, addressing modes. They then combine the operands and the instruction into a binary opcode, and add it to a byte vector stored in a dynamically-scoped variable. So instead of calling methods on, and passing around, an 'assembler object' as you would in say, a JIT coded in C++, you wrap the code generation in a call to Factor's make word, and simply invoke instruction words therein. The result looks like assembler source, except it is postfix. Here is an x86 example:

[
    EAX ECX ADD
    XMM0 XMM1 HEX: ff SHUFPS
    AL 7 OR
    RAX 15 [+] RDI MOV
] B{ } make .

When evaluated, the above will print out the following;

B{ 1 200 15 198 193 255 131 200 7 72 137 120 15 }

Of course, B{ is the homoiconic syntax for a byte array. Note the way indirect memory operands are constructed; first, we push the register (RAX) then the displacement (here the constant 15, but register displacement is supported by x86 too). Then we call a word [+] which constructs an object representing the addressing mode [RAX+15].

The rationale for choosing this somewhat funny syntax for indirect operands (there is also a [] word for memory loads without a displacement), rather than some kind of parser hack that allows one to write [RAX] or [R14+RDI] directly, is that in reality the compiler only rarely deals with hard-coded register assignments. Instead, the register allocator makes decisions a level above, and passes them to the code generator. Here is a typical compiler code generation template from the cpu.x86 vocabulary:

M:: x86 %check-nursery ( label temp1 temp2 -- )
    temp1 load-zone-ptr
    temp2 temp1 cell [+] MOV
    temp2 1024 ADD
    temp1 temp1 3 cells [+] MOV
    temp2 temp1 CMP
    label JLE ;

Here, I'm using the locals vocabulary together with the assembler. The temp1 and temp2 parameters are registers and label is, as its name implies, a label to jump to. This snippet generates machine code that checks whether or not the new object allocation area has enough space; if so, it jumps to the label, otherwise it falls through (code to save live registers and call the GC is generated next). The load-zone-ptr word is like an assembler macro here; it takes a register and generates some more code with it.

The PowerPC assembler is a tad more interesting. Since the x86 instruction set is so complex, with many addressing modes and so on, the x86 assembler is implemented in a rather tedious manner. Obvious duplication is abstracted out. However, there is a lot of case-by-case code for different groups of instructions, with no coherent underlying abstraction allowing the instruction set to be described in a declarative way.

On PowerPC, the situation is better. Since the instruction set is a lot more regular (fixed width instructions, only a few distinct instruction formats, no addressing modes), the PowerPC assembler itself is built using a DSL specifically for describing PowerPC instructions:

D: ADDI 14
D: ADDIC 12
D: ADDIC. 13
D: ADDIS 15
D: CMPI 11
D: CMPLI 10
...

The PowerPC instruction format DSL is defined in the cpu.ppc.assembler.backend vocabulary, and as a result the cpu.ppc.assembler vocabulary itself is mostly trivial.

Last words

Usually my blog posts describe recent progress in the Factor implementation, and I tend to write about what I'm working on right now. I'm currently working on code generation for SIMD vector instructions in the Factor compiler. I was going to blog about this instead, but decided not to do it until the SIMD implementation and API settles down some more.

With this post I decided to try something a bit different, and instead just describe an aspect of Factor that interests to me, without any of it being particularly breaking news. If you've been following Factor development closely, there is literally nothing in this post that you would not have seen already, however I figured people who don't track the project so closely might appreciate a general survey like this. I'm also thinking of writing a post describing Factor's various high-level I/O libraries. I'd appreciate any feedback, suggestions and ideas on this matter.

Saturday, September 12, 2009

Advanced floating point features: exceptions, rounding modes, denormals, unordered compares, and more

Factor now has a nice library for introspecting and modifying the floating point environment. Joe Groff implemented most of it, and I helped out with debugging and additional floating point comparison operations. All of these features are part of the IEEE floating point specification and are implemented on all modern CPUs, however few programming languages expose a nice interface to working with them. C compilers typically provide hard-to-use low-level intrinsics and other languages don't tend to bother at all. Two exceptions are the SBCL compiler and the D language.

The new functionality is mostly contained in the math.floats.env vocabulary, with a few new words in math for good measure. The new code is in the repository but it is not completely stable yet; there are still some issues we need to work out on the more obscure platforms.

To follow along with the examples below, you'll need to get a git checkout from the master branch and load the vocabulary in your listener:

USE: math.floats.env

The first two features, floating point exceptions and traps, are useful for debugging numerical algorithms and detecting potentially undesirable situations (NaNs appearing, underflow, overflow, etc).

Floating point exceptions

One of the first things people learn about floating point is that it has "special" values: positive and negative infinity, and not-a-number (NaN) values. These appear as the results of computations where the answer is undefined (division by zero, square root of -1, etc) or the answer is too small or large to be represented as a float (2 to the power of 10000, etc). A less widely-known fact is that when a special value is computed, "exception flags" are set in a hardware register which can be read back in. Most languages do not offer any way to access this functionality.

In Factor, exception flags can be read using the collect-fp-exceptions combinator, which first clears the flags, calls a quotation, then outputs any flags which were set. For example, division by zero sets the division by zero exception flag and returns infinity:

( scratchpad ) [ 1.0 0.0 / ] collect-fp-exceptions . .
{ +fp-zero-divide+ }
1/0.

Dividing 1 by 3 sets the inexact flag, because the result (0.333....) cannot be represented as a float:

( scratchpad ) [ 1.0 3.0 / ] collect-fp-exceptions . .
{ +fp-inexact+ }
0.3333333333333333

The fact that 1/3 does not have a terminating decimal or binary expansion is well-known, however one thing that many beginners find surprising is that some numbers which have terminating decimal expansions nevertheless cannot be represented precisely as floats because they do not terminate in binary (one classic case is 1.0 - 0.9 - 0.1 != 0.0):

( scratchpad ) [ 4.0 10.0 / ] collect-fp-exceptions . .
{ +fp-inexact+ }
0.4

Raising a number to a power that is too large sets both the inexact and overflow flags, and returns infinity:

( scratchpad ) [ 2.0 10000.0 ^ ] collect-fp-exceptions . .
{ +fp-inexact+ +fp-overflow+ }
1/0.

The square root of 4 is an exact value; no exceptions were set:

( scratchpad ) [ 4.0 sqrt ] collect-fp-exceptions . .
{ }
2.0

The square root of 2 is not exact on the other hand:

( scratchpad ) [ 2.0 sqrt ] collect-fp-exceptions . .
{ +fp-inexact+ }
1.414213562373095

Factor supports complex numbers, so taking the square root of -1 returns an exact value and does not set any exceptions:

( scratchpad ) [ -1.0 sqrt ] collect-fp-exceptions . .
{ }
C{ 0.0 1.0 }

However, we can observe the invalid operation exception flag being set if we call the internal fsqrt word, which operates on floats only and calls the libc function (or uses the SQRTSD instruction on SSE2):

( scratchpad ) USE: math.libm [ -1.0 fsqrt ] collect-fp-exceptions . .
{ +fp-invalid-operation+ }
NAN: 8000000000000

I describe the new NAN: syntax later in this post.

Signaling traps

Being able to inspect floating point exceptions set after a piece of code runs is all well and good, but what if you have a tricky underflow bug, or a NaN is popping up somewhere, and you want to know exactly where? In this case it is possible to set a flag in the FPU's control register which triggers a trap when an exception is raised. This trap is delivered to the Factor process as a signal (Unix), Mach exception (Mac OS X), or SEH exception (Windows). Factor then throws it as an exception which can be caught using any of Factor's error handling words, or just left unhandled in which case it will bubble up to the listener.

The with-fp-traps combinator takes a list of traps and runs a quotation with those traps enabled; when the quotation completes (or throws an error) the former FPU state is restored again (indeed it has to be this way, since running the Factor UI's rendering code with traps enabled quickly kills it). The all-fp-exceptions word is equivalent to specifying { +fp-invalid-operation+ +fp-overflow+ +fp-underflow+ +fp-zero-divide+ +fp-inexact+ }. Here is an example:

( scratchpad ) all-fp-exceptions [ 0.0 0.0 / ] with-fp-traps
Floating point trap

Without the combinator wrapped around it, 0.0 0.0 / simply returns a NaN value without throwing anything.

Rounding modes

Unlike exceptions and traps, which do not change the result of a computation but merely set status flags (or interrupt it), the next two features, the rounding mode and denormal mode, actually change the results of computations. As with exceptions and traps, they are implemented as scoped combinators rather than global state changes to ensure that code using these features is 'safe' and cannot change floating point state of surrounding code.

If a floating point operation produces an inexact result, there is the question of how the result will be rounded to a value representable as a float. There are four rounding modes in IEEE floating point:

+round-nearest+
+round-down+
+round-up+
+round-zero+

Here is an example of an inexact computation done with two different rounding modes; the default (+round-nearest+) and +round-up+:

( scratchpad ) 1.0 3.0 / .
0.3333333333333333
( scratchpad ) +round-up+ [ 1.0 3.0 / ] with-rounding-mode .
0.3333333333333334

Denormals

Denormal numbers are numbers where the exponent consists of zero bits (the minimum value) but the mantissa is not all zeros. Denormal numbers are undesirable because they have lower precision than normal floats, and on some CPUs computations with denormals are less efficient than with normals. IEEE floating point supports two denormal modes: you can elect to have denormals "flush" to zero (+denormal-flush+), or you can "keep" denormals (+denormal-keep+). The latter is the default:

( scratchpad ) +denormal-flush+ [ 51 2^ bits>double 0.0 + ] with-denormal-mode .
0.0
( scratchpad ) 51 2^ bits>double 0.0 + .
1.112536929253601e-308

Ordered and unordered comparisons

In math, for any two numbers a and b, one of the following three properties hold:

a < b
a = b
a > b

In floating point, there is a fourth possibility; a and b are unordered. This occurs if one of the two values is a NaN. The unordered? predicate tests for this possibility:

( scratchpad ) NAN: 8000000000000 1.0 unordered? .
t

If an ordered comparison word such as < or >= is called with two values which are unordered, they return f and set the +fp-invalid-operation+ exception:

( scratchpad ) NAN: 8000000000000 1.0 [ < ] collect-fp-exceptions . .
{ +fp-invalid-operation+ }
f

If traps are enabled this will throw an error:

( scratchpad ) NAN: 8000000000000 1.0 { +fp-invalid-operation+ } [ < ] with-fp-traps    
Floating point trap

If your numerical algorithm has a legitimate use for NaNs, and you wish to run it with traps enabled, and have certain comparisons not signal traps when inputs are NaNs, you can use unordered comparisons in those cases instead:

( scratchpad ) NAN: 8000000000000 1.0 [ u< ] collect-fp-exceptions . .
{ }
f

Unordered versions of all the comparisons are defined now, u<, u<=, u>, and u>=. Equality of numbers is always unordered, so it does not raise traps if one of the inputs is a NaN. In particular, if both inputs are NaNs, equality always returns f:

( scratchpad ) NAN: 8000000000000 dup [ number= ] collect-fp-exceptions . .
{ }
f

Half-precision floats

Everyone and their drunk buddy know about IEEE single (32-bit) and double (64-bit) floats; IEEE also defines half-precision 16-bit floats. These are not used nearly as much; they come up in graphics programming for example, since GPUs use them for certain calculations with color components where you don't need more accuracy. The half-floats vocabulary provides some support for working with half-floats. It defines a pair of words for converting Factor's double-precision floats to and from half-floats, as well as C type support for passing half-floats to C functions via FFI, and building packed arrays of half-floats for passing to the GPU.

Literal syntax for NaNs and hexadecimal floats

You may have noticed the funny NAN: syntax above. Previously all NaN values would print as 0/0., however this is inaccurate since not all NaNs are created equal; because of how IEEE floating point works, a value is a NaN if the exponent consists of all ones, leaving the mantissa unspecified. The mantissa is known as the "NaN payload" in this case. NaNs now print out, and can be parsed back in, using a syntax that makes the payload explicit. A NaN can also be constructed with an arbitrary payload using the <fp-nan> word:

( scratchpad ) HEX: deadbeef <fp-nan> .
NAN: deadbeef

The old 0/0. syntax still works; it is shorthand for NAN: 8000000000000, the canonical "quiet" NaN.

Some operations produce NaNs with different payloads:

( scratchpad ) USE: math.libm
( scratchpad ) 2.0 facos .
NAN: 8000000000022

In general, there is very little you can do with the NaN payload.

A more useful feature is hexadecimal float literals. When reading a float from a decimal string, or printing a float to a decimal string, there is sometimes ambiguity due to rounding. No such problem exists with hexadecimal floats.

An example of printing a number as a decimal and a hexadecimal float:

( scratchpad ) USE: math.constants
( scratchpad ) pi .
3.141592653589793
( scratchpad ) pi .h
1.921fb54442d18p1

Java supports hexadecimal float literals as of Java 1.5. Hats off to the Java designers for adding this! It would be nice if they would add the rest of the IEEE floating point functionality in Java 7.

Signed zero

Unlike twos-complement integer arithmetic, IEEE floating point has both positive and negative zero. Negative zero is used as a result of computations of very small negative numbers that underflowed. They also have applications in complex analysis because they allow a choice of branch cut to be made. Factor's abs word used to be implemented incorrectly on floats; it would check if the input was negative, and if so multiply it by negative one. However this was a problem because negative zero is not less than zero, and so the absolute value of negative zero would be reported as negative zero. The correct implementation of the absolute value function on floats is to simply clear the sign bit. It works properly now:

( scratchpad ) -0.0 abs .
0.0

Implementation

The implementation of the above features consists of several parts:

Cross-platform Factor code in the math.floats.env vocabulary implementing the high-level API
Assembly code in vm/cpu-x86.32.S, vm/cpu-x86.64.S, and vm/cpu-ppc.S to read and write x87, SSE2 and PowerPC FPU control registers
Low-level code in math.floats.env.x86 and math.floats.env.ppc which implements the high-level API in terms of the assembly functions, by calling them via Factor's FFI and parsing control registers into a cross-platform representation in terms of Factor symbols
Miscellaneous words for taking floats apart into their bitwise representation in the math vocabulary
Compiler support for ordered and unordered floating point compare instructions in compiler.cfg.instructions
CPU-specific code generation for ordered and unordered floating point compare instructions in cpu.x86 and cpu.ppc

Wednesday, September 02, 2009

Eliminating some boilerplate in the compiler

Adding new instructions to the low-level optimizer was too hard. Multiple places had to be updated, and I would do all this by hand:

The instruction tuple itself is defined in the compiler.cfg.instructions vocabulary with the INSN: class, which also creates a word with the same name that constructs the instruction and adds it to the current sequence.
Instructions which have a destination register have convenient constructors in compiler.cfg.hats which creates a fresh virtual register, creates an instruction with this register as the destination, and outputs it. So for example, 1 2 ^^add would create an add instruction with a fresh destination register, and output this register. It might be equivalent to something like 0 1 2 ##add.
Instructions that use virtual registers are be added to the vreg-insn union, and respond to methods defs-vreg, uses-vregs, and temp-vregs in compiler.cfg.def-use. This 'def-use' information is used by SSA construction, dead code elimination, copy coalescing, register allocation, among other things.
Methods have to be defined for the instruction in compiler.cfg.renaming.functor. This functor is used to generate code for renaming virtual registers in instructions. The renaming code is used for SSA construction, representation selection, register allocation, among other things.
Instructions which use non-integer representations (eg, operations on floats) must respond to methods defs-vreg-rep, uses-vreg-reps, and temp-vreg-reps in compiler.cfg.representations.preferred.
Instructions must respond to the generate-insn method, defined in compiler.codegen.
Instructions which participate in value numbering must define an "expression" variant, and respond to the >expr method defined in compiler.cfg.value-numbering.expressions

As you can see, this is a lot of duplicated work. I used inheritance and mixins to model relationships between instructions and reduce some of this duplication by defining methods on common superclasses rather than individual instructions where possible, but this never seemed to work out well.

If you look at the latest versions of the source files I linked to above, you'll see that all the repetitive copy-pasted code has been replaced with meta-programming. The new approach extends the INSN: parsing word so now all relevant information is specified in one place. Also there is a new PURE-INSN: parsing word, to mark instructions which participate in value numbering; previously this was done with a superclass. For example,

PURE-INSN: ##add
def: dst/int-rep
use: src1/int-rep src2/int-rep ;

This defines an instruction tuple, an expression tuple, def/use information, representation information, a method for converting instructions to expressions, and a constructor, all at the same time.

For the code generator's generate-insn method, not every instruction has a straightforward implementation; some, like GC checks and FFI calls, postpone a fair amount of work until the very last stage of compilation. However, for most instructions, this method simply extracts all the slots from the instruction tuple, then calls a CPU-specific hook in the cpu.architecture vocabulary to generate code. For these cases, a CODEGEN: parsing word sets up the relevant boilerplate;

CODEGEN: ##add %add

is equivalent to

M: ##add generate-insn [ dst>> ] [ src1>> ] [ src2>> ] tri %add ;

This nicely cleans up all the repetition I mentioned in the bullet points at the top.

I've been aware of this boilerplate for a while but wanted the design of the compiler to settle down first. Now that most of the machinery is in place, I feel comfortable cooking up some complex meta-programming to clean things up. Adding new instructions should be easier. I plan on adding some SSE2 vector operations soon, and this was the main motivation behind this cleanup.

How would you do this in other languages? In Lisp, you would use a macro which expands into a bunch of defclass, defmethod, etc forms. In Java, you might use annotations:

public class AddInsn extends Insn {
    @Def @Representation("int") Register dst;
    @Use @Representation("int") Register src1;
    @Use @Representation("int") Register src2;
}

Saturday, August 29, 2009

Struct arrays benchmark revisited: trig function calls are slow in Java, but without them Factor is still 3x faster

My struct arrays benchmark generated a fair amount of discussion on reddit, with some people disputing the benchmark's validity. Certainly I'm not claiming that Factor is faster than Java HotSpot in general (on most tasks it is slower, sometimes much more so), however I think the benchmark legitimately demonstrates the performance advantage of value semantics over reference semantics.

A few people pointed out that the Java version of the benchmark was spending a lot of time in trigonometric functions, and that Java computes sin and cosine "by hand" rather than using x87 FPU instructions. However, the same is also true of the Factor implementation, its just that none of the Java advocates bothered to check. I don't use x87 instructions either, and call into libc for sin and cos, just like Java. Indeed, Factor's sin and cos are even more heavyweight than Java's, because they also support complex numbers; Factor's compiler converts them to their real-valued equivalents if it can prove the inputs are real.

Despite this, I modified the benchmark to not call trig functions, and instead just initialize each point as (n+1,n+2,(n+3)*(n+3)/2). Factor is still faster by roughly the same margin as before:

Java	829ms (best run out of 8)
Factor	284ms

A few people also mentioned that by only running the test 8 times I wasn't giving HotSpot enough time to "warm up". However, the benchmark does not make any polymorphic calls in the inner loops, so there's really no "warm up" needed at all; and indeed I got the best time on the third iteration.

I improved Factor's code generation for trigonometric function calls. Trig function calls are treated like instructions by the low-level optimizer now, which means they participate in value numbering and do not split basic blocks. Instead, the register allocator spills all live registers at the call site at the very end of code generation. This strategy does not work in general for all FFI calls, because the case where the FFI calls invokes a Factor callback needs additional runtime support. However, neither sin nor cos do this. After implementing this, I noticed that my struct-arrays benchmark was calling sin() twice on the same input, and value numbering was folding the second call away. Neat optimization to get "for free".

This code generation change improves performance of both the struct-arrays and partial-sums benchmarks:

Benchmark	Before	After
struct-arrays	1483ms	749ms
partial-sums	1233ms	938ms

Here is the disassembly for benchmark.struct-arrays with this optimization performed, and this is the patch.

For what its worth, Java HotSpot runs the partial-sums benchmark in 1293ms. Hopefully HotSpot's trig function call performance will receive some attention at some point.

Friday, August 28, 2009

Performance comparison between Factor and Java on a contrived benchmark

Recently, Joe Groff has been working on struct classes, with the aim of completely replacing Factor's existing support for C structure types. The interesting thing about Joe's struct classes is that unlike the clunky old FFI struct support which operated on byte arrays, struct classes really are Factor classes; instances look and feel like Factor objects, there's literal syntax and prettyprinter support, and structure fields are accessed using accessor words that look exactly like Factor tuple slot accessors.

So whereas the old C structure support didn't have much use outside of the FFI, the new struct classes become a useful and type-safe abstraction in themselves. Tuples are composed of dynamically-typed slots which reference Factor values, and structs are composed of scalar data stored in contiguous memory. This makes them useful in code that otherwise does not use the FFI.

What makes struct classes even more useful in performance-sensitive code is the related struct-arrays vocabulary. This presents a sequence of struct values, stored in contiguous memory, as a Factor sequence. This allows ordinary high-level sequence operations and combinators (map, filter, change-each, ...) to be used on scalar data with a very efficient in-memory representation.

It is interesting that Factor's core is rather high-level and dynamically typed, but various libraries built off to the side using meta-programming facilities implement features useful for systems programming, such as specialized array types, allowing binary pointerless data to be represented and manipulated efficiently. This is unprecedented in dynamic languages, where the usual approach is to farm out performance-intensive work to another language.

The performance difference between, say, an array of pointers to boxed floats, and a raw array of floats, cannot be understated. Further levels of structure make the difference even more dramatic: an array of pointers to objects where each one has three fields pointing to boxed floats (what a mouthful), is considerably less efficient to work with than an array of structures with three float fields each. In the former case, a great many objects are allocated and pointers traversed while working with the data; in the latter case, all data can be stored in one contiguous memory block, that appears as a simple byte array to the garbage collector.

Dynamic languages with advanced JITs, such as the recent JavaScript implementations, hit a performance barrier imposed by their in-memory object representations. Even Java, which has a reputation of being very fast as far as managed, GC'd languages go, suffers here. While Java can pack scalar data into a single instance (so if an object has three float fields, the float data is stored directly in the object); however it does not offer arrays of objects where the objects are stored directly in the array. This negatively impacts performance, as you will see.

There is a buzzword that describes this approach of dealing with data: value semantics. If you find some whiny C++ programmers, soon enough one of them will mumble something about value semantics, and with good reason: because C++ offers value semantics, it often has a performance edge over other programming languages. While production-ready Java and C++ implementations both implement a similar set of optimizations, language semantics contribute to C++ being faster for a lot of code. Of course, C++ is low-level and has unsafe pointers; however, as Factor demonstrates, you can have a high-level managed language that still provides support for value semantics in a safe way.

I decided to whip up a simple benchmark. Here is what it does:

It works on points, which are triplets of single-precision floats, (x,y,z)
First, the benchmark creates a list of 5000000 points for i=0..4999999, where the ith point is (sin(i),cos(i)*3,sin(i)*sin(i)/2).
Then, each point is normalized; the x, y, and z components are divided by sqrt(x*x+y*y+z*z).
Finally, the maximum x, y, and z components are found, for all points, and this is printed out.

Note that in-place operations, and re-using temporary objects, is allowed.

Here is the code:

The Factor version is both shorter, and has more blank lines.

Note that the Factor version is intended to be run as a vocabulary from the Factor environment, using the time word, as follows:

[ "benchmark.struct-arrays" run ] time

Run it a few of times so that the data heap can grow to a stable size.

The Java code is self-contained; run it with

java -Xms512m -server silly

The Java version runs the benchmark 8 times, and prints the best time; this gives the HotSpot Server JIT a chance to 'warm up'.

The JVM shipped with Mac OS X 10.5 (build 1.5.0_19-b02-304) runs the benchmark in 4.6 seconds, and the Factor version ran in 2.2 seconds using a Factor build from earlier this evening. I made a couple of further improvements to the Factor compiler, bringing the runtime of the Factor version of the benchmark down to 1.4 seconds in the latest source tree:

Added intrinsics for min and max so that when they are applied to values known to be floating point at compile-time, the SSE2 instructions MINSD and MAXSD are used.
Added some optimizations to unbox intermediate <displaced-alien> values constructed by the struct array code. A displaced alien is a value representing a pointer and an offset; it is a heap-allocated object with two slots. This is how struct classes internally implement the backing store for objects which are stored 'in the middle' of a byte array.

The Factor code is several layers of abstraction removed from the low-level machine code that is generated in the end. It takes the following optimizations to get good performance out of it:

All words declared inline are inlined, all higher-order functions are inlined.
Literal quotations (written in square brackets) are inlined at their call sites.
Macros, such as sum-outputs are expanded.
Sequence words, struct field accessors, and generic math operations are all replaced with lower-level type-specific equivalents using type inference; in many cases, these operations, such as adding floats, map directly to machine instructions
The incrementing counter for initializing points is converted into a fixed-precision integer because interval and def-use analysis determine it is safe to do so
Escape analysis determines that various temporary objects can be stack-allocated:
- closures created by various combinators, such as tri-curry, each, and so on
- the struct array object created with <struct-array>
- the struct class instances created inside the many calls to each
- the struct created at the end to store the maximum value
Of the remaining memory allocations, those that create small objects are completely inlined, and do not call into the VM at all
Stack analysis eliminates most of the runtime data stack manipulation in favor of keeping values in CPU registers
Representation analysis figures out that all of the intermediate float values can be unboxed and stored in floating point registers, except for this that live across the sin/cos calls
While the sin and cos functions result in calls into libc, sqrt is open-coded as an SSE2 instruction
Value numbering eliminates redundant pointer arithmetic and the boxing/unboxing of pointer values
GC checks are open-coded and inlined at every basic block which performs an allocation; the GC check compares the nursery pointer against the limit, and in the fast case, where the nursery is not full, no subroutine call is made, and no registers need to be saved and restored
The coalescing pass eliminates unnecessary register to register copy instructions that arise from SSA form, as well as ensuring that in most cases, the result of an arithmetic instruction is the same as the first operand; this helps x86 instruction selection
The register allocator assigns virtual registers to real CPU registers; in this case there is no spilling at all on x86-64

So what does this say about language expressiveness? Any dynamic language that a) offers the same set of metaprogramming tools as Factor b) has a reasonably advanced JIT could do the same. On the other hand, working value semantics into something like Java is pretty much impossible. I invite you implement this benchmark in your favorite language and share your timings.

Monday, August 24, 2009

New tool for locating external resource leaks

While having garbage collection solves the problem of manually deallocating a block of memory when its lifetime ends, it doesn't help with external resources, such as sockets and file handles, at all. And even in a GC'd language, memory has to be managed manually sometimes, for example when passing data to and from a C library using FFI. So Factor has had a generic disposal protocol, as well as destructors, for quite some time. What was missing was tooling support.

I wanted to build a tool that would help me debug code that was leaking external resources by forgetting to dispose them. Thankfully, this doesn't come up often; as C++ programmers using RAII like to note, scoped destructors solve resource management in 90% of all cases, and Factor's resource management combinators are even more flexible. However, sometimes an external resource can have a complex lifetime, because of caching, pooling, and other advanced idioms. In these cases, having tools to help track leaks down can really help.

I took inspiration from Doug Coleman's managed malloc and free implementation. Whereas some languages use finalizers to ensure that an external resource gets cleaned up if you forget to dispose of it explicitly, I take the opposite approach; all active disposable objects are stored in a central registry, explicitly preventing the GC from cleaning them up.

The new tools.destructors vocabulary introduces two words, disposables. and leaks. The first word prints a tally of all active resources:

Clicking 'list instances' next to a disposable class opens an inspector window with all active instances of this resource; for example, here we can list all file descriptors that Factor has open right now:

You can even dispose of the resource right there, as if you had called the dispose word on it:

The leaks combinator compares active resources before and after a quotation runs, and lists any that were not disposed of. For example, in the following, I construct a file-reader, read a line, and never dispose of it:

Notice how a file descriptor, together with a few associated resources, was leaked as a result of this.

If I wanted to, I could click on 'show instances' and dispose of the input-port in the inspector; this would dispose of the other two associated objects as well. Also, opening the inspector on a disposable resource that was allocated inside a leaks call will display a stack trace, showing where in the code the resource was allocated. This can help pinpoint the offending code.

Of course, the correct way to read a line from a file in Factor is to use a combinator which cleans up the stream for you. In the following example, the resource leak has been fixed:

To define a new disposable resource, simply create a tuple class that subclasses disposable, make sure to construct it with new-disposable, and override the dispose* method.

Here is an example that manages a limited pool of "X"s, of which there are 100 total:

SYMBOL: xs

100 >vector xs set-global

: get-x ( -- x ) xs get pop ;
: return-x ( x -- ) xs get push ;

TUPLE: my-disposable-resource < disposable x ;

: <my-disposable-resource> ( -- disposable )
    my-disposable-resource new-disposable get-x >>x ;

M: my-disposable-resource dispose* x>> return-x ;

The disposable resource machinery is built with very little code, and it is used throughout Factor already.

Sunday, December 27, 2009

Sunday, December 06, 2009

Introduction

Relocation tables

Literal tables

Eliminating literal tables

Coping with code redefinition

Additional space savings in deployed binaries

Size comparison

Monday, November 16, 2009

Mark-sweep-compact garbage collection for tenured space

Mark phase

Sweep phase

Compact phase

Code heap compaction

New object alignment and tagging scheme

Thursday, October 15, 2009

Code restructuring

Partial array card marking

Old way

Problem

New way

Implementation

New object promotion policy

Old way

The problem

New way

Implementation

Faster code heap remembered set

Old way

New way

Implementation

Faster card marking and object allocation

Implementation

Performance comparison

Wednesday, September 30, 2009

Preliminaries

Extending the parser with parsing words

Homoiconic syntax

Compile-time macros

Parsing word-based DSLs

Local variables

Parsing expression grammars

Infix expressions

XML literals

C library interface

GPU shaders

Instruction definitions in the compiler

Other examples

Inverse

Help system

x86 and PowerPC assemblers

Last words

Saturday, September 12, 2009

Floating point exceptions

Signaling traps

Rounding modes

Denormals

Ordered and unordered comparisons

Half-precision floats

Literal syntax for NaNs and hexadecimal floats

Signed zero

Implementation

Wednesday, September 02, 2009

Saturday, August 29, 2009

Friday, August 28, 2009

Monday, August 24, 2009

Older Posts

Recent Factor Commits

Planet Factor

Links