Tuesday, May 26, 2009

Factor's implementation of polymorphic inline caching

Polymorphic inline caching is a technique to speed up method dispatch by generating custom code at method call sites that checks the receiver against only those classes that have come up at that call site. If the receiver's class does not appear in the polymorphic inline cache (this case is known as a "cache miss"), then a new call stub is generated, which checks the receiver's class in addition to the classes already in the stub. This proceeds until the stub reaches a maximum size, after which point further cache misses rewrite the caller to call a general "megamorphic" stub which performs a full method dispatch. A full description can be found in the original paper on the topic.

I implemented polymorphic inline caching and megamorphic caching to replace Factor's previous method dispatch implementation.

Call sites


A generic word's call site is in one of three states:
  • Cold state - the call instruction points to a stub which looks at the class of the object being dispatched upon, generates a new PIC stub containing a single entry, updates the call site, and jumps to the right method.
  • Inline cache state - the call instruction points to a generated PIC stub which has one or more class checks, followed by a jump to a method if the checks succeed. If none of the checks succeed, it jumps to the inline cache miss routine. The miss routine does one of two things:
    • If the PIC stub already has the maximum number of entries, it sets the call site to point to the generic word's megamorphic stub, shared by all megamorphic call sites.
    • If the PIC stub has less than the maximum number of entries, a new PIC stub is generated, with the same set of entries as the last one, together with a new entry for the class being dispatched on.
  • Megamorphic state - the call instruction points to the generic word's megamorphic stub. Further dispatches do not modify the call instruction.

When code is loaded into the image, or a full garbage collection occurs, all call sites revert to the cold state. Loading code might define new methods or change class relationships, so caches have to be invalidated in that case to preserve correct language semantics. On a full GC, caches are reverted so that serially monomorphic code is optimized better; if a generic word is called in a long loop with one type, then in a long loop with another type, and so on, it will eventually become megamorphic. Resetting call sites once in a while (a full GC is a relatively rare event) ensures that inline caches better reflect what is going on with the code right now. After implementing PICs, I learned that the V8 JavaScript VM uses the same strategy to invalidate call sites once in a while, so it sounds like I did the right thing.

Direct class checks versus subclass checks


Factor supports various forms of subtyping. For example, tuple classes can inherit from other tuple classes, and also, a union class can be defined whose members are an arbitrary set of classes. When a method is defined on a class, it will apply to all subclasses as well, unless explicitly overriden. Checking if an object is an instance of a class X is a potentially expensive operation, because it is not enough to compare direct classes, superclasses and union members have to be inspected also. Polymorphic inline caching only deals with direct classes, because direct class checks are much quicker (for tuples, just compare the tuple layouts for pointer equality; for instances of built-in classes, compare tag numbers). So for example if a generic word has a single method for the sequence class, and a call site calls the generic with arrays and strings, then the inline cache generated there will have two checks, one for strings, and one for arrays, and both checks will point at the same method.

Inline cache stubs


Inline cache stubs are generated by the VM in the file inline_cache.cpp. This generates machine code, but I'll use C-like pseudocode for examples instead. An inline cache stub looks like this, where index is the parameter number being dispatched on (generic words defined with GENERIC: use 0, generic words defined with GENERIC# can dispatch on any parameter):

void *obj = datastack[index];
void *class = class_of(obj);
if(class == cached_class_1) cached_method_1();
else if(class == cached_class_2) cached_methd_2();
...
else cache_miss(call_site_address,index,class,generic);

Getting the direct class of an object (what I call class_of() in the pseudocode) is a relatively cheap operation. I briefly discussed pointer tagging, type headers and tuple layout headers in my C++ VM post. In the most general case, getting the direct class looks like this, again in vague C-like pseudocode:
if(tag(obj) == hi_tag) return obj.header;
else if(tag(obj) == tuple_layout) return obj.layout;
else return tag(obj);

However, if the cached classes consist only of tag classes (the 7 built-in types whose instances can be differentiated based on tag alone) then the generated stub doesn't need to do the other checks; it just compares the pointer tag against the cached classes. Similarly, if tag and tuple classes are in the cache but no hi-tag classes, one branch can be avoided. In total, the PIC generation code uses one of four stubs for getting the object's class, and the right stub to use is determined by looking at the cached classes (see the function determine_inline_cache_type() in inline_cache.cpp).

Once the stub has been generated, it is installed at the call site and invoked. This way, a PIC miss is transparent to the caller.

Patching call sites


The code for getting and setting jump targets from jump and call instructions is of course CPU-specific; see the get_call_target() and set_call_target() functions, and note that on PowerPC, the instruction cache needs to be flushed manually:


Recall that there are two situations in which the PIC miss code can be invoked:
  • For a cold call site, to generate the initial stub
  • For a polymorphic call site, to add a new entry to the PIC

If the call site is a non-tail call, then the return address will be pushed on the stack (x86) or stored in a link register (PowerPC). If the call site is a tail call, then Factor's code generator now moves the current instruction pointer into a register before performing the tail call. This code is generated even when tail-calling words that are not generic; it is a no-op in this case. This is only a single integer load, so other than increasing code size it really has no effect on performance, since on a modern out-of-order CPU the load and jump will execute concurrently. For example, here is how the code snippet beer get length compiles on x86:
0x000000010b8628c0: mov    $0x10b8628c0,%r8
0x000000010b8628ca: pushq $0x20
0x000000010b8628cf: push %r8
0x000000010b8628d1: sub $0x8,%rsp
0x000000010b8628d5: add $0x10,%r14
0x000000010b8628d9: mov $0x10a6c8186,%rax
0x000000010b8628e3: mov %rax,-0x8(%r14)
0x000000010b8628e7: mov $0x100026c60,%rax
0x000000010b8628f1: mov (%rax),%rcx
0x000000010b8628f4: mov %rcx,(%r14)
0x000000010b8628f7: callq 0x10b1e0de0
0x000000010b8628fc: add $0x18,%rsp
0x000000010b862900: mov $0x10b86290f,%rbx
0x000000010b86290a: jmpq 0x10b70f4a0

The first few instructions set up a call stack frame, then the beer symbol is pushed on the stack, and the (inlined) definition of get follows. Finally, the address of the jump instruction is loaded into the rbx register and a jump is performed. Initially, the jump points at the PIC miss stub, so the first time this machine code executes, the jump instruction's target will be changed to a newly-generated PIC.

However, the inline_cache_miss() function in the VM takes the call site address as an input parameter, and after all, it is written in C++. Where does this input parameter actually come from? The answer is that there are short assembly stubs that sit between the actual call site and the PIC miss code in the VM. The stubs get the return address in a CPU-specific manner, and then call the VM. This code is in the primitive_inline_cache_miss procedure, which is again defined in several places:

Note that the functions have two entry points. The first entry point is taken for non-tail call sites, the second one is taken for tail call sites.

An optimization


Since allocation of PICs is something that happens rather frequently, it is good to optimize this operation. I implemented segregated free lists for the code heap. Allocating lots of small code heap blocks requires less work now. Also, since every PIC is referenced from at most one call site, when a call site's PIC is regenerated with a new set of cached classes, the only PIC can be returned to the free list immediately. This reduces the frequency of full garbage collections (when the code heap fills up, a full GC must be performed; there is no generational tenuring for code blocks).

Megamorphic caching


If a call site is called with more than a few distinct classes of instances (the default maximum PIC size is 3 entries) then it is patched to call the megamorphic dispatch stub for that generic word. Megamorphic stubs are again generated in the VM, in the dispatch.cpp source file. The machine code that is generated, were it to be expressed in pseudo-C, looks like this:
void *obj = datastack[index];
void *class = class_of(obj);
if(cache[class] == class) goto cache[class + 1];
else megamorphic_cache_miss(class,generic,cache);

Every generic word has a fixed-size megamorphic cache of 32-entries (this can be changed by bootstrapping a new image, but I'm not sure there's any point). It works like a hashtable, except if there is a collision, the old entry is simply evicted; there is no bucket chaining or probing strategy here. While less efficient than a polymorphic inline cache hit (which only entails a direct jump), megamorphic cache hits are still rather quick (some arithmetic and an indirect jump). There are no calls into the VM, it is all inline assembly. A megamorphic inline cache miss calls the mega_cache_miss primitive which computes the correct method and updates the cache.

Performance


I added some instrumentation code to the VM which counts the number of cache hits and misses. The majority of call sites are monomorphic or polymorphic, and megamorphic call sites are very rare. Megamorphic cache misses are rarer still.

The performance gain for benchmark runtimes was as I expected; a 5-10% gain. The old method dispatch system was already pretty efficient, and modern CPUs have branch prediction which helped a lot with it. For compile time, the gain was a lot more drastic, however, and definitely made the effort I put into implementing PICs pay off; bootstrap is now a full minute faster than before, clocking in at around 3 minutes, and the load-all word, which loads and compiles all the libraries in the distribution, used to take 25 minutes and now takes 11 minutes. All timings are from my MacBook Pro.

The compile-time gain is not a direct result of the inline caching, but rather stems from the fact that the compiler has to compile less code. With the old method dispatch implementation, every generic word was an ordinary word under the hood, with a huge, auto-generated body containing conditionals and jump tables. When a method was added or removed, the dispatcher code had to be re-generated and re-compiled, which takes a long time. With polymorphic inline caching, what happens is now the VM essentially lazily generates just those parts of the dispatch code which are actually used, and since it uses the VM's JIT code instead of the optimizing compiler, it can just glue bits of machine code together instead of building an IR and optimizing it; which yielded no benefits for the type of code that appeared in a generic word body anyway.

Improving compile time is one of my major goals for Factor 1.0, since it improves user experience when doing interactive development, and I'm very happy with the progress so far.

Friday, May 22, 2009

Live status display for Factor's build farm, and other improvements

Lately I've made a few additional improvements to the build infrastructure. First, build reports sent to the Factor-builds mailing list are now formatted as HTML, which makes them a bit more readable. Second, there is a Twitter feed of binary uploads: @FactorBuilds. Finally, we now have a live status display for build machines. Previously if I wanted to know what a build machine was doing, I'd have to log in with ssh and poke around -- total waste of time. Now, just clicking on the OS/CPU combination on factorcode.org takes you to a status page. For example, here is the status page for Mac OS X/x86. At the time of this blog post, I see that this machine is running tests for a certain GIT revision. Each GIT revision is a link to the GitHub mirror of the Factor repository so you can see exactly what's going on. Finally, the latest build report can be viewed too; this has the build failures, if any, as well as benchmark timings.

Our continuous build system is called mason and adds up to a total of 1196 lines of code. Eduardo Cavazos wrote the first iteration, called builder. I forked it and added additional features.

It's been over a year since we started using continuous integration in the Factor project, and I can say its been an overwhelming success. The first iteration of the build system would load all libraries and run all unit tests, and send an email with the results. If everything succeeded, it would upload a binary package. Over the last year, more than 5000 binary packages were uploaded. Over time, we added more checks to the build farm. It now runs help lint checks which ensure that examples in documentation work and that there are no broken links, and checks for compiler errors.

I think the code quality has definitely gone up over the last year; having a dozen machines running tests all the time does a good job of triggering obscure bugs, and the automatic upload of binaries when tests pass saves us from the hassle of making manual releases.

I think pretty soon, we're going to start having releases again, in addition to continuous builds. I intend on making the release process semi-automatic; when I decide to do a release, I want to send some type of command to all the build machines to have them build a given GIT ID and upload the binaries to a special directory. The goal is to have regular releases until 1.0, starting from a few weeks from now. I'm not going to commit to a schedule for 1.0 yet, but having regular releases and published change logs is the first step of the process.

Thursday, May 21, 2009

Unboxed tuple arrays in Factor

One difference between high level languages and systems languages is that high level languages typically hide the notion of a pointer, and by extension, value versus reference semantics, from the programmer. Systems languages make this explicit, and the programmer can make a choice whether or not to pass an object by value or by reference.

In particular, this level of control can lead to significant space savings for large arrays of homogeneous data. If the elements of an array are themselves not shared by other structures, then storing values directly in the array will, at the very least, save on the space usage of a pointer, together with an object header.

I don't believe there is any reason for high-level languages to not offer more fine-grained control over heap storage. Factor is dynamically typed but it offers both arrays of unboxed numeric values, and arrays of tuples, which I will describe below. In some cases, using these special-purpose data types can offer significant performance and space gains over polymorphic arrays.

Consider a Factor tuple definition:
TUPLE: point x y ;

We can find out how much space points use in memory -- this is on a x86-64 machine:
( scratchpad ) point new size .
32

The x and y slots are as wide as a machine pointer, so together they use 16 bytes. This means that there is an additional 16-byte overhead for every instance. Indeed, Factor tuples store a header and a pointer to a layout object (shared by all instances of that class) in addition to the slot data itself. While this compares favorably to other dynamic languages (see this blog post for instance; Ruby in particular has really terrible object representation), it still adds up to a significant overhead when you have a large collection of points.

Suppose you have an array of N points. Each array element is a pointer (8 bytes) to a point object (32 bytes). Assuming the points are not referenced from any other collection, that's 40*N bytes for the entire collection. However, there is only 16*N bytes of real data, namely the x and y slots of every element, so we're wasting more than 50% of heap usage here, not to mention increasing GC overhead by allocating tons of small objects.

This overhead can be avoided, however, using Factor's tuple-arrays library. Tuple arrays implement the same sequence protocol as arrays, but the semantics differ. Instead of storing references to objects, they store object slots directly within the array. There are two main restrictions that result from this:
  • Tuple arrays are homogeneous; all elements are of the exact same type. This rules out setting some elements to f to represent missing data, or having elements which are instances of different subclasses of the same base class.
  • Tuple arrays implement value semantics. If you get an element out, and mutate it, you have to store it back into the array to update the original value in the array.

So how do tuple arrays work? The original implementation, by Daniel Ehrenberg defined a single tuple-array class where the constructor took a class word as a parameter, and reflection APIs were used to implement getting and setting elements. While this had all the space savings described above, constructing and deconstructing tuples via reflection is not the fastest thing in the world; even the most efficient possible implementation would require some indirection and type checking in order to work.

A better approach is to generate a new sequence type for each element type. This is what the current implementation does. Because all of Factor's sequence operations are written in terms of a sequence protocol, these generated classes are for the most part transparent to the programmer, and there is no added inconvenience over the previous approach. To use it, you invoke a parsing word with the element type class you want to use:
TUPLE-ARRAY: point

Assuming a class named point was previously defined, this generates a point-array class with associated methods.

Briefly, the implementation works as follows. The underlying storage of a tuple array is an ordinary array. If the tuple class has N slots, then every group of N elements in the underlying array is a single element in the tuple array. The nth and set-nth methods, which are generated once for every tuple array, read and write a group of N elements and package them into a tuple.

Now let's dive into the guts of the implementation. The parsing word is defined as follows:
SYNTAX: TUPLE-ARRAY: scan-word define-tuple-array ;

So it delegates all the hard work to a define-tuple-array word. This word is a functor, so tuple arrays work the same as specialized numeric arrays. Functors are similar to C++ templates in spirit, but really they are implemented entirely as a library, and it's all just syntax sugar for parsing word machinery (which is a lot more powerful than C++ templates). They're named "functors" to annoy category theory fanboys and language purists. A functor definition begins with FUNCTOR::
FUNCTOR: define-tuple-array ( CLASS -- )

A functor is essentially the same as a word defined with ::, but with additional syntax sugar. So CLASS here is a local variable available in the body of the functor. I use the convention of uppercase names for functor parameters. What follows FUNCTOR: is a list of clauses which create new words at parse time. The left hand side of each clause is a new local variable name that the generated word will be bound to, and the right hand side is an interpolate form describing what the new word should be named. The middle operator specifies whether or not the word should already exist (IS) or if it should be defined (DEFINED). So the definition of define-tuple-array proceeds as follows:
CLASS IS ${CLASS}

CLASS-array DEFINES-CLASS ${CLASS}-array
CLASS-array? IS ${CLASS-array}?

<CLASS-array> DEFINES <${CLASS}-array>
>CLASS-array DEFINES >${CLASS}-array

Next, a functor definition calls WHERE, which switches the parser to reading the functor body.

After that, what follows is a series of forms which look like word definitions, but really they parse as code which defines words at the time that the functor is executed. Both the word names and bodies may reference local variables defined in the first part of the functor:
WHERE

TUPLE: CLASS-array
{ seq array read-only }
{ n array-capacity read-only }
{ length array-capacity read-only } ;

: <CLASS-array> ( length -- tuple-array )
[ \ CLASS [ tuple-prototype <repetition> concat ] [ tuple-arity ] bi ] keep
\ CLASS-array boa ; inline

M: CLASS-array length length>> ;

M: CLASS-array nth-unsafe tuple-slice \ CLASS read-tuple ;

M: CLASS-array set-nth-unsafe tuple-slice \ CLASS write-tuple ;

M: CLASS-array new-sequence drop <CLASS-array> ;

: >CLASS-array ( seq -- tuple-array ) 0 <CLASS-array> clone-like ;

M: CLASS-array like drop dup CLASS-array? [ >CLASS-array ] unless ;

INSTANCE: CLASS-array sequence

So every time someone calls define-tuple-array (most likely with the TUPLE-ARRAY: parsing word) a new class is defined, together with a constructor word and some methods.

Finally, we end the functor with:
;FUNCTOR

To illustrate, this means that the following:
TUPLE-ARRAY: point

Is equivalent to the following:
TUPLE: point-array
{ seq array read-only }
{ n array-capacity read-only }
{ length array-capacity read-only } ;

: <point-array> ( length -- tuple-array )
[ \ point [ tuple-prototype <repetition> concat ] [ tuple-arity ] bi ] keep
\ point-array boa ; inline

M: point-array length length>> ;

M: point-array nth-unsafe tuple-slice \ point read-tuple ;

M: point-array set-nth-unsafe tuple-slice \ point write-tuple ;

M: point-array new-sequence drop <point-array> ;

: >point-array ( seq -- tuple-array ) 0 <point-array> clone-like ;

M: point-array like drop dup point-array? [ >point-array ] unless ;

INSTANCE: point-array sequence

Now imagine writing all of the above code out every time you want a specialized tuple array; that would amount to boilerplate. Of course, at this point, it still looks like runtime reflection is happening, because we're passing the point class word to the read-tuple and write-tuple words. However, what actually happens is that these words get inlined, then the fact that the class parameter is a constant triggers a series of macro expansions and constant folding optimizations that boil the code down to an efficient, specialized routine for packing and unpacking points from the array. If you're interested in the gory details, take a look at the tuple arrays source code.

So the space savings are clearly nice to have, but what about performance? Here is a program that uses polymorphic arrays:
TUPLE: point { x float } { y float } ;

: polymorphic-array-benchmark ( -- )
5000 [ point new ] replicate
1000 [
[
[ 1+ ] change-x
[ 1- ] change-y
] map
] times drop ;

Here is the same program using tuple arrays:
TUPLE: point { x float } { y float } ;

TUPLE-ARRAY: point

: tuple-array-benchmark ( -- )
5000 <point-array>
1000 [
[
[ 1+ ] change-x
[ 1- ] change-y
] map
] times drop ;

On my MacBook Pro, the first version using polymorphic arrays runs in 0.39 seconds, the second version using tuple arrays runs in 0.91 seconds. Clearly, there is some overhead from copying the objects in and out of the array.

Now, let's try a slightly different program. Instead of mutating the elements of the array, let's instead extract the slots, modify them, and make a new tuple. Here is a version using polymorphic arrays:
TUPLE: point { x float read-only } { y float read-only } ;

: polymorphic-array-benchmark ( -- )
5000 [ point new ] replicate
1000 [
[
[ x>> 1+ ] [ y>> 1- ] bi <point>
] map
] times drop ;

This version runs in 0.94 seconds; the additional object allocation induces a 3x overhead over mutating the points in place. Now, let's try the same thing with tuple arrays:
TUPLE: point { x float read-only } { y float read-only } ;

TUPLE-ARRAY: point

: tuple-array-benchmark ( -- )
5000 <point-array>
1000 [
[
[ x>> 1+ ] [ y>> 1- ] bi <point>
] map
] times drop ;

For this variant I'm getting runtimes of 0.59 seconds, which is faster than the same code with polymorphic arrays. The reason for the speed boost is that the Factor compiler's escape analysis pass kicks in, eliminating all tuple allocation from within the loop completely.

So to summarize, in order from fastest to slowest,
  • Polymorphic arrays with in-place mutation
  • Tuple arrays with read-only tuples
  • Polymorphic arrays with read-only tuples
  • Tuple arrays with in-place mutation

Finally, a note about benchmarking methodology. Because garbage collection can add an element of unpredictability to benchmarks, I force run the GC to run first:
( scratchpad ) gc [ tuple-array-benchmark ] time
== Running time ==

0.593989 seconds

The combination of a high-level language with efficient abstractions that the compiler knows how to optimize is very powerful. Factor occupies a middle ground between the dynamic languages and systems languages. I want Factor to be a language that you can write an entire application in, even the performance-critical parts. Often you hear that "performance doesn't matter", but outside of a few I/O-bound problem domains, this is false (and even there, you want to heavily optimize your I/O and text processing routines). Very often, people cite some variation of the "80/20 performance rule" (20% of your program runs 80% of the time; adjust percentages to suit your strawman). Of course, the idea is that you can write the 80% in a high level language, dropping down to C for the performance-critical 20%. I think this reasoning is flawed, for two reasons:
  • As my friend Cameron Zwarich likes to point out, very little justification is given for the 80/20 rule itself. For many programs, the performance profile is rather flat, and every indirection, every unnecessary runtime check, and every unnecessary runtime memory allocation adds up to a non-trivial amount of overhead. The "performance bottleneck" myth originated in a survey of Fortran programs which did numerical calculations. For numerics, it is indeed true that most of the runtime is spent in a handful of inner loops; whether or not this is true for ordinary programs is far less clear.
  • Even if runtime really is dominated by a small number of routines, chances are these routines are algorithmically complex, hard to write, and nice to experiment with interactively; precisely the domain where high-level languages with advanced abstractions shine.

I think it's about time dynamic language advocates stopped citing the same old myths to justify poor implementation techniques. Their energy would be better spent researching compiler optimizations and garbage collection techniques which they can apply in their favorite language implementations instead. Hats off to the V8, SquirrelFish, Unladen Swallow and SBCL projects for taking performance seriously.

Friday, May 08, 2009

Factor VM ported to C++

Most of Factor is implemented in Factor (the parser, data structures, optimizing compiler, input/output, the UI toolkit, etc) however a few bits and pieces are not completely self hosting. This includes the garbage collector, non-optimizing compiler (used for bootstrap), and various runtime services. This code comprises what I call the "Factor VM". Even though it is not a Virtual Machine in the classical sense, since there's no interpreter or bytecode format, it serves a similar role to the VM of scripting languages.

Ever since Factor lost its JVM dependency in mid-2004, the VM has been written in C with some GNU extensions. Over time, I've noticed that a few things in the VM source code seem to be hard to express and error-prone in straight C. I've been thinking about switching to C++ for a long time, and now I've finally decided to bite the bullet and do it. I'm pleased with the result, and other than a longer compile time for the VM, I don't have any complaints so far.

I started off by first getting the existing C code to compile with the GNU C++ compiler. Since C++ is almost, but not quite, a super set of C, I had to make a few changes to the source:
  • Renaming a few identifiers (I had local variables named class, template and new in some places)
  • Fixing some type safety issues (C will implicitly cast void* to any other pointer type, whereas C++ will not)
  • Moving global variable declarations to source files, and changing the declarations in the headers to be extern
  • Making a few functions extern "C" since they are called from code generated by the Factor compiler

Once this was done and everything was working, I started the process of refactoring my C pain points using C++ language features and idioms. This process is far from complete; at the end of the blog post I will outline what remains to be done.

Miscellaneous improvements


While the Factor VM is pretty small, around 13,000 lines of code right now, being able to use C++ namespaces is still nice since I've had name clashes with system C headers in the past. Now everything is wrapped in a factor namespace, so this should occur less in the future. Also, I changed a bunch of #define to inline functions and constants.

Tagged pointers


Factor is dynamically typed. While the optimizing compiler infers types and eliminates type checks sometimes, this generates machine code directly and mostly sidesteps the VM entirely. As far as the VM is concerned, it must be possible to determine the type of a value from its runtime representation. There are many approaches to doing this; the one I've been using from the start is "tagged pointers".

Here is how it works: in the Factor VM, objects are always allocated on 8-byte boundaries in the data heap. This means every address is a multiple of 8, or equivalently, the least significant 3 bits of a raw address are zero. I take advantage of this to store some type information in these bits. A related trick that the Factor VM does is to reserve a tag for small integers that fit in a pointer; so a 29-bit integer (on 32-bit platforms) or a 61-bit integer (on 64-bit platforms) does not need to be heap-allocated.

With 3 tag bits, we get a total of 8 unique tags. Factor has more than 8 data types, though, and indeed the user can define their own data types too. So there are two tags used to denote that the data type is actually stored in the object's header. One of these is used for built-in VM types that don't have a unique tag number, another one is used for user-defined tuple class instances.

Since C and C++ do not have native support for tagged pointers, the VM needs a mechanism to convert tagged pointers to untagged pointers, and vice versa. When converting a tagged pointer to an untagged pointer, a type check is performed.

In the C VM, I had code like the following:
#define TAG_MASK 7
#define UNTAG(tagged) (tagged & ~TAG_MASK)

typedef struct { ... } F_ARRAY;

#define ARRAY_TYPE 2

F_ARRAY *untag_array(CELL tagged)
{
type_check(ARRAY_TYPE,tagged);
return (F_ARRAY *)UNTAG(tagged);
}

Here, CELL is typedef'd to an unsigned integer as wide as void*, and UNTAG is a macro which untags a pointer by masking off the low 3 bits (7 decimal is 111 binary).

To convert untagged pointers to tagged form, I used code like the following:
#define RETAG(untagged,tag) (((CELL)(untagged) & ~TAG_MASK) | (tag))

CELL tag_array(F_ARRAY *untagged)
{
return RETAG(untagged,ARRAY_TYPE);
}

For hi-tag types, such as strings, I used a single tag_object() function:
CELL tag_object(void *untagged)
{
return RETAG(untagged,OBJECT_TYPE);
}

The OBJECT_TYPE constant was misleadingly named, since everything in Factor is an object. However, in the VM it means "an object with type info in its header".

The old C code for tagging and untagging pointers was mostly adequate, however it suffered from a few issues. First, there was a lot of boilerplate; each tag type has an untag/tag function pair, and each hi-tag type had an untag function. Furthermore, every time I changed a hi-tag type into a tag type, I had to find usages of tag_object() and change them appropriately. Over time I developed several preprocessor macros to clean up this type of stuff but I was never happy with it.

In C++, the whole thing came out much cleaner. The UNTAG and RETAG macros are still around (although I may change them to inline functions at some point). However, for tagging and untagging pointers, I use a single set of template functions. To tag an untagged pointer:
template <typename T> cell tag(T *value)
{
return RETAG(value,tag_for(T::type_number));
}

That's it; one function for all data types. It is used like so:
array *a = ...;
cell c = tag(a);

How does it work? The F_ARRAY struct is now the array class, and it has a static variable type_number:
struct array : public object
{
static const cell type_number = 2;
...
}

How do hi-tag pointers get handled? Well, their type numbers are all greater than or equal to 8. The tag_for() function checks for this, and returns the object tag number of it is the case. Since it is an inline function, the computation is constant-folded at compile time, and the generated code is as efficient as the old C version, only more type-safe and with less boilerplate.

For untagging tagged pointers, I use a more complicated scheme. I have a templated "smart pointer" class representing a tagged pointer with an associated data type. Here is a simplified version of the class; this omits the type checking logic and a bunch of other utility methods; if you want the gory details, the full source is in tagged.hpp.
template <typename T>
struct tagged
{
cell value_;
T *untagged() const { return (T *)(UNTAG(value_)); }
T *operator->() const { return untagged(); }
explicit tagged(T *untagged) : value_(factor::tag(untagged)) { }
}

The overload of operator-> means that tagged pointers can be used like ordinary pointers on some cases without needing to be untagged first. Otherwise, a pointer stored in a cell can be untagged like so:
array *a = tagged(c).untagged()

I made a utility function to encapsulate this type of thing, for when I just want to untag a value once and not pass it around as a smart pointer:
template  T *untag(cell value)
{
return tagged(value).untagged();
}

Now I can write
array *a = untag<array>(c)

This compiles down to the same machine code as the C version but it is more type-safe and less verbose.

Registration of local variables with the garbage collector


Any object allocation can trigger a GC, and the GC must be able to identify all live values at that point. For compiled Factor code, this is pretty easy. The GC needs to scan a few globals, the data and retain stacks, as well as the call stack; call frames for Factor code have a known format. The compiler does store objects in registers, but it moves them to the stack before calling into the GC.

However, the VM code itself uses the garbage collector to allocate its own internal state. This makes the VM code more robust since it doesn't need to worry about when to deallocate memory, but it presents a problem: if a C local variable points into the GC heap, how does the GC know that the value is live, and how does the GC know to update the pointer of the value got moved (Factor's GC is a generational semi-space collector so objects move around a lot). Between function calls, compiled C code saves pointers in the call frame. However, since I have no control over the machine code gcc generates, I cannot force call stack frames into a certain format that the GC can understand, like I do with compiled Factor code.

One solution is conservative garbage collection. The garbage collector can just pessimistically assume that any integer variable on the call stack might point into the GC heap. There are two problems with this approach. The first is if you have a bit pattern that happens to look like a pointer, but is really just a random number, you might keep a value live more than it is really needed. From what I've heard, this is mostly just a theoretical concern. A more pressing issue is that since you do not know what is a real pointer and what isn't, you cannot move objects that might be referenced from the call stack. This can lead to fragmentation and more expensive allocation routines.

The technique I've been using until now in the C VM code was to have a pair of macros, REGISTER_ROOT and UNREGISTER_ROOT. These push and pop the address of a local variable on a "GC root stack". So by traversing the GC root stack, the garbage collector knows which call stack locations refer to objects in the GC heap, and which are just integers. The disadvantage of this approach is that pairing root registration was verbose and error-prone, and forgetting to regsiter a root before calling a function which may potentially allocate memory would lead to hard-to-discover bugs that would be triggered rarely and result in memory corruption.

Using C++ language features, I instead implemented a "smart pointer" gc_root class. It subclasses the tagged smart pointer I described above, with the additional functionality in the constructor and destructor to register and unregister a pointer to the wrapped pointer as a GC root. Together with the overloaded operator-> this makes for much clearer and safer code. By updating existing code to use this new smart pointer I found a few places in the old C code where I had forgotten to register roots. No doubt one or two users saw Factor crash because of these. You can look at the source code for the GC root abstraction.

Object-oriented abstractions


Most functions in the VM are still top-level C-style functions, however in a few places I've found good reason to use methods and even inheritance. I've avoided virtual methods so far. One place where inheritance has really helped clean up some crufty old C code is in what I call the JIT. This is a simple native code generator that the VM uses to fill in the gaps for the optimizing compiler that's written in Factor. The JIT is used to generate machine code in the following situations:

The old design was built out of C macros and code duplication. Now there is a jit class with various methods to emit machine code. Subclasses of jit, such as quotation_jit and inline_cache_jit, add additional functionality. The constructor and destructor take care of registering and unregistering the objects used by the JIT with the GC.

Functionality moved out of the VM


Since I prefer programming in Factor over low-level languages such as C and C++, I try to keep the VM small. A couple of things were in the VM for historical reasons only. The complex number and rational number data types, for instance, were defined in the VM, even though operations on them were in Factor code. This is because the number tower predated user-defined tuple types by a year or so. Nowadays there is no reason not to use tuples to define these types, so I made the change, simplifying the VM. Another piece of functionality that no longer had a good reason to be in the VM was some vestigial code for converting Factor's Unicode string type to and from Latin-1 and UCS-2 encodings. Ever since Daniel Ehrenberg implemented the io.encodings library in Factor, most string encoding and decoding was done in library code, and a lot more encodings were implemented; UTF8, Shift-JIS, MacRoman, EBCDIC, etc. Only a couple of things still relied on the VM's broken built-in string encoding. They're now using the library just like everything else.

Code heap compaction


Factor allocates compiled machine code in their own heap, separate from the data heap. A mark/sweep/compact garbage collector manages the code heap. When performing a compaction, the code heap garbage collector needs to maintain an association from a code block's old address to its new address. Previously this would be stored in the code block's header, wasting 4 bytes (or 8 bytes on 64-bit platforms) per code block. Now I use std::tr1::unordered_map to do this. Of course I could've written a hashtable in C, or used an existing implementation, but using STL is more convenient.

Future improvements


I don't spend much time working on the Factor VM, since most of the action is in the compiler, library, and UI, which are all written in Factor and not C++. However, at some point I'm going to do more work on the garbage collector, and I will do some more cleanups at that time:
  • Eliminate usages of GNU C extensions. I want to be able to compile the Factor VM with other compilers, such as Microsoft's Visual Studio, and LLVM's Clang.
  • Eliminating global variables
  • Using new/delete instead of malloc() and free() for the small set of data that the VM allocates in unmanaged memory
  • Various other cleanups, such as making better use of method invocation and templates
  • There's some code duplication in the garbage collector I hope to address with C++ templates


Conclusion


A lot of these cleanups and improvements could have been achieved in C, with clever design and various preprocessor macro tricks. However, I think C++ offers a better set of abstractions for the problem domain that the Factor VM lives in, and since I only use the language features that have no runtime penalty (I'm not using C++ exceptions, virtual methods, or RTTI) there's no performance decrease either. I'm very happy with how the code cleanup turned out. Making the VM easier to understand and maintain will help with implementing more advanced garbage collection algorithms, among other things.