Friday, May 08, 2009

Factor VM ported to C++

Most of Factor is implemented in Factor (the parser, data structures, optimizing compiler, input/output, the UI toolkit, etc) however a few bits and pieces are not completely self hosting. This includes the garbage collector, non-optimizing compiler (used for bootstrap), and various runtime services. This code comprises what I call the "Factor VM". Even though it is not a Virtual Machine in the classical sense, since there's no interpreter or bytecode format, it serves a similar role to the VM of scripting languages.

Ever since Factor lost its JVM dependency in mid-2004, the VM has been written in C with some GNU extensions. Over time, I've noticed that a few things in the VM source code seem to be hard to express and error-prone in straight C. I've been thinking about switching to C++ for a long time, and now I've finally decided to bite the bullet and do it. I'm pleased with the result, and other than a longer compile time for the VM, I don't have any complaints so far.

I started off by first getting the existing C code to compile with the GNU C++ compiler. Since C++ is almost, but not quite, a super set of C, I had to make a few changes to the source:
  • Renaming a few identifiers (I had local variables named class, template and new in some places)
  • Fixing some type safety issues (C will implicitly cast void* to any other pointer type, whereas C++ will not)
  • Moving global variable declarations to source files, and changing the declarations in the headers to be extern
  • Making a few functions extern "C" since they are called from code generated by the Factor compiler

Once this was done and everything was working, I started the process of refactoring my C pain points using C++ language features and idioms. This process is far from complete; at the end of the blog post I will outline what remains to be done.

Miscellaneous improvements


While the Factor VM is pretty small, around 13,000 lines of code right now, being able to use C++ namespaces is still nice since I've had name clashes with system C headers in the past. Now everything is wrapped in a factor namespace, so this should occur less in the future. Also, I changed a bunch of #define to inline functions and constants.

Tagged pointers


Factor is dynamically typed. While the optimizing compiler infers types and eliminates type checks sometimes, this generates machine code directly and mostly sidesteps the VM entirely. As far as the VM is concerned, it must be possible to determine the type of a value from its runtime representation. There are many approaches to doing this; the one I've been using from the start is "tagged pointers".

Here is how it works: in the Factor VM, objects are always allocated on 8-byte boundaries in the data heap. This means every address is a multiple of 8, or equivalently, the least significant 3 bits of a raw address are zero. I take advantage of this to store some type information in these bits. A related trick that the Factor VM does is to reserve a tag for small integers that fit in a pointer; so a 29-bit integer (on 32-bit platforms) or a 61-bit integer (on 64-bit platforms) does not need to be heap-allocated.

With 3 tag bits, we get a total of 8 unique tags. Factor has more than 8 data types, though, and indeed the user can define their own data types too. So there are two tags used to denote that the data type is actually stored in the object's header. One of these is used for built-in VM types that don't have a unique tag number, another one is used for user-defined tuple class instances.

Since C and C++ do not have native support for tagged pointers, the VM needs a mechanism to convert tagged pointers to untagged pointers, and vice versa. When converting a tagged pointer to an untagged pointer, a type check is performed.

In the C VM, I had code like the following:
#define TAG_MASK 7
#define UNTAG(tagged) (tagged & ~TAG_MASK)

typedef struct { ... } F_ARRAY;

#define ARRAY_TYPE 2

F_ARRAY *untag_array(CELL tagged)
{
type_check(ARRAY_TYPE,tagged);
return (F_ARRAY *)UNTAG(tagged);
}

Here, CELL is typedef'd to an unsigned integer as wide as void*, and UNTAG is a macro which untags a pointer by masking off the low 3 bits (7 decimal is 111 binary).

To convert untagged pointers to tagged form, I used code like the following:
#define RETAG(untagged,tag) (((CELL)(untagged) & ~TAG_MASK) | (tag))

CELL tag_array(F_ARRAY *untagged)
{
return RETAG(untagged,ARRAY_TYPE);
}

For hi-tag types, such as strings, I used a single tag_object() function:
CELL tag_object(void *untagged)
{
return RETAG(untagged,OBJECT_TYPE);
}

The OBJECT_TYPE constant was misleadingly named, since everything in Factor is an object. However, in the VM it means "an object with type info in its header".

The old C code for tagging and untagging pointers was mostly adequate, however it suffered from a few issues. First, there was a lot of boilerplate; each tag type has an untag/tag function pair, and each hi-tag type had an untag function. Furthermore, every time I changed a hi-tag type into a tag type, I had to find usages of tag_object() and change them appropriately. Over time I developed several preprocessor macros to clean up this type of stuff but I was never happy with it.

In C++, the whole thing came out much cleaner. The UNTAG and RETAG macros are still around (although I may change them to inline functions at some point). However, for tagging and untagging pointers, I use a single set of template functions. To tag an untagged pointer:
template <typename T> cell tag(T *value)
{
return RETAG(value,tag_for(T::type_number));
}

That's it; one function for all data types. It is used like so:
array *a = ...;
cell c = tag(a);

How does it work? The F_ARRAY struct is now the array class, and it has a static variable type_number:
struct array : public object
{
static const cell type_number = 2;
...
}

How do hi-tag pointers get handled? Well, their type numbers are all greater than or equal to 8. The tag_for() function checks for this, and returns the object tag number of it is the case. Since it is an inline function, the computation is constant-folded at compile time, and the generated code is as efficient as the old C version, only more type-safe and with less boilerplate.

For untagging tagged pointers, I use a more complicated scheme. I have a templated "smart pointer" class representing a tagged pointer with an associated data type. Here is a simplified version of the class; this omits the type checking logic and a bunch of other utility methods; if you want the gory details, the full source is in tagged.hpp.
template <typename T>
struct tagged
{
cell value_;
T *untagged() const { return (T *)(UNTAG(value_)); }
T *operator->() const { return untagged(); }
explicit tagged(T *untagged) : value_(factor::tag(untagged)) { }
}

The overload of operator-> means that tagged pointers can be used like ordinary pointers on some cases without needing to be untagged first. Otherwise, a pointer stored in a cell can be untagged like so:
array *a = tagged(c).untagged()

I made a utility function to encapsulate this type of thing, for when I just want to untag a value once and not pass it around as a smart pointer:
template  T *untag(cell value)
{
return tagged(value).untagged();
}

Now I can write
array *a = untag<array>(c)

This compiles down to the same machine code as the C version but it is more type-safe and less verbose.

Registration of local variables with the garbage collector


Any object allocation can trigger a GC, and the GC must be able to identify all live values at that point. For compiled Factor code, this is pretty easy. The GC needs to scan a few globals, the data and retain stacks, as well as the call stack; call frames for Factor code have a known format. The compiler does store objects in registers, but it moves them to the stack before calling into the GC.

However, the VM code itself uses the garbage collector to allocate its own internal state. This makes the VM code more robust since it doesn't need to worry about when to deallocate memory, but it presents a problem: if a C local variable points into the GC heap, how does the GC know that the value is live, and how does the GC know to update the pointer of the value got moved (Factor's GC is a generational semi-space collector so objects move around a lot). Between function calls, compiled C code saves pointers in the call frame. However, since I have no control over the machine code gcc generates, I cannot force call stack frames into a certain format that the GC can understand, like I do with compiled Factor code.

One solution is conservative garbage collection. The garbage collector can just pessimistically assume that any integer variable on the call stack might point into the GC heap. There are two problems with this approach. The first is if you have a bit pattern that happens to look like a pointer, but is really just a random number, you might keep a value live more than it is really needed. From what I've heard, this is mostly just a theoretical concern. A more pressing issue is that since you do not know what is a real pointer and what isn't, you cannot move objects that might be referenced from the call stack. This can lead to fragmentation and more expensive allocation routines.

The technique I've been using until now in the C VM code was to have a pair of macros, REGISTER_ROOT and UNREGISTER_ROOT. These push and pop the address of a local variable on a "GC root stack". So by traversing the GC root stack, the garbage collector knows which call stack locations refer to objects in the GC heap, and which are just integers. The disadvantage of this approach is that pairing root registration was verbose and error-prone, and forgetting to regsiter a root before calling a function which may potentially allocate memory would lead to hard-to-discover bugs that would be triggered rarely and result in memory corruption.

Using C++ language features, I instead implemented a "smart pointer" gc_root class. It subclasses the tagged smart pointer I described above, with the additional functionality in the constructor and destructor to register and unregister a pointer to the wrapped pointer as a GC root. Together with the overloaded operator-> this makes for much clearer and safer code. By updating existing code to use this new smart pointer I found a few places in the old C code where I had forgotten to register roots. No doubt one or two users saw Factor crash because of these. You can look at the source code for the GC root abstraction.

Object-oriented abstractions


Most functions in the VM are still top-level C-style functions, however in a few places I've found good reason to use methods and even inheritance. I've avoided virtual methods so far. One place where inheritance has really helped clean up some crufty old C code is in what I call the JIT. This is a simple native code generator that the VM uses to fill in the gaps for the optimizing compiler that's written in Factor. The JIT is used to generate machine code in the following situations:

The old design was built out of C macros and code duplication. Now there is a jit class with various methods to emit machine code. Subclasses of jit, such as quotation_jit and inline_cache_jit, add additional functionality. The constructor and destructor take care of registering and unregistering the objects used by the JIT with the GC.

Functionality moved out of the VM


Since I prefer programming in Factor over low-level languages such as C and C++, I try to keep the VM small. A couple of things were in the VM for historical reasons only. The complex number and rational number data types, for instance, were defined in the VM, even though operations on them were in Factor code. This is because the number tower predated user-defined tuple types by a year or so. Nowadays there is no reason not to use tuples to define these types, so I made the change, simplifying the VM. Another piece of functionality that no longer had a good reason to be in the VM was some vestigial code for converting Factor's Unicode string type to and from Latin-1 and UCS-2 encodings. Ever since Daniel Ehrenberg implemented the io.encodings library in Factor, most string encoding and decoding was done in library code, and a lot more encodings were implemented; UTF8, Shift-JIS, MacRoman, EBCDIC, etc. Only a couple of things still relied on the VM's broken built-in string encoding. They're now using the library just like everything else.

Code heap compaction


Factor allocates compiled machine code in their own heap, separate from the data heap. A mark/sweep/compact garbage collector manages the code heap. When performing a compaction, the code heap garbage collector needs to maintain an association from a code block's old address to its new address. Previously this would be stored in the code block's header, wasting 4 bytes (or 8 bytes on 64-bit platforms) per code block. Now I use std::tr1::unordered_map to do this. Of course I could've written a hashtable in C, or used an existing implementation, but using STL is more convenient.

Future improvements


I don't spend much time working on the Factor VM, since most of the action is in the compiler, library, and UI, which are all written in Factor and not C++. However, at some point I'm going to do more work on the garbage collector, and I will do some more cleanups at that time:
  • Eliminate usages of GNU C extensions. I want to be able to compile the Factor VM with other compilers, such as Microsoft's Visual Studio, and LLVM's Clang.
  • Eliminating global variables
  • Using new/delete instead of malloc() and free() for the small set of data that the VM allocates in unmanaged memory
  • Various other cleanups, such as making better use of method invocation and templates
  • There's some code duplication in the garbage collector I hope to address with C++ templates


Conclusion


A lot of these cleanups and improvements could have been achieved in C, with clever design and various preprocessor macro tricks. However, I think C++ offers a better set of abstractions for the problem domain that the Factor VM lives in, and since I only use the language features that have no runtime penalty (I'm not using C++ exceptions, virtual methods, or RTTI) there's no performance decrease either. I'm very happy with how the code cleanup turned out. Making the VM easier to understand and maintain will help with implementing more advanced garbage collection algorithms, among other things.

5 comments:

Harold Fowler said...

Pretty good points dude, well done!

RT
www.privacy-web.net.tc

aaronblohowiak said...

I think in other scenarios where there is a component to the language that is bundled in to facilitate the language's mechanics, but does not work on byte-code, they call it "the runtime". The example that comes to mind is the Objective-C runtime. So, "runtime" gets overloaded to mean not only the step at which code is executed but also to mean the system that facilitates its execution. I think that this is slightly better than having the word VM associated with factor as it is in no way interpreted (IIRC.)

Anonymous said...

With 3 bits, I think you have 8 possibilities, not 7?

Shouldn't RETAG first make sure the pointer is not tagged in the first place?

Anonymous said...

Hi,

Its really nice and informative post. I am recently start learning c and c++, its interesting.

Thanks for sharing.

Nadeem Rao
learn unix c c++

acai said...

This includes the garbage collector, non-optimizing compiler (used for bootstrap), and various runtime services. This code comprises what I call the "Factor VM". Even though it is not a Virtual Machine in the classical sense, since there's no interpreter or bytecode format, it serves a similar role to the VM of scripting languages.

I'm a bit puzzled by this; maybe somebody will volunteer to clear it up.

The Factor VM is no VM in the classical sense (i.e. it just is no Virtual Machine), but it is, apparently, a runtime in any sense you could reasonably come up with. So why is it the Factor VM, rather than the Factor Runtime?