Wednesday, July 28, 2010

Overhauling Factor's C library interface

For the last while I've been working on an overhaul of Factor's C library interface and associated compiler infrastructure. The goals of the rewrite were three-fold:

  • Improving performance
  • Cleaning up some crusty old code that dates back to my earliest experiments with native code generation
  • Laying the groundwork for future compiler improvements

These changes were all behind-the-scenes; I did not introduce any new functionality or language changes, and I think Factor's FFI is already quite stable and feature-complete.

The previous FFI implementation

I started work on Factor's C library interface (FFI) almost immediately after I bootstrapped the native implementation off of JFactor. I began experimenting with an SDL-based UI early on and immediately decided I wanted to have a real FFI, instead of extending the VM with primitives which wrap C functions by hand.

Over time, both the FFI and compiler have evolved in parallel, but there was little integration between them, other than the fact that both used the same assembler DSL under the hood. The result is that FFI calls were generated in a somewhat inefficient way. Since the optimizing compiler had no knowledge of how they work, it would save all values to the data stack first. Then the FFI code generator would take over; it would pop the input parameters one by one, pass them to a per-type C function in the Factor VM which unboxed the value, then stored the value in the stack frame. Finally, the target C function would be invoked, then another C function in the VM would box the return value, and the return value would be pushed to the stack. The optimizing compiler would then take over, possibily generating code to pop the value from the stack.

The redundant stack traffic was wasteful. In some cases, the optimizing compiler would generate code to box a value and push it to the stack, only to have the FFI then generate code to pop it and unbox it immediately after. To make matters worse, over time the optimizing compiler gained the ability to box and unbox values with open-coded assembly sequences, but the FFI would still make calls into the VM to do it.

All in all, it was about time I rewrote the FFI, modernizing it and integrating it with the rest of the compiler in the process.

Factoring FFI calls into simpler instructions

Most low-level IR instructions are very simple; FFI calls used to be the exception. Now I've split these up into smaller instructions. Parameters and return values are now read and written from the stack with the same ##peek and ##replace instructions that everything else uses, and boxing and unboxing parameters and return values is now done with the representation selection pass. A couple of oddball C types, such as long long on x86-32, still require VM calls to box and unbox, and I added new instructions for those.

One slightly tricky thing that came up was that I had to re-engineer low-level IR to support instructions with multiple output values. This is required for C calls which return structs and long long types by value, since each word-size chunk of the return value is returned in its own register, and these chunks have to be re-assembled later. In the future, I will be able to use this support to add instructions such as the x86 division instruction, which computes x / y and x mod y simultaneously.

I also had to change low-level IR to distinguish between instructions with side effects and those without. Previously, optimization passes would assume any instruction with an output value did not have side effects, and could be deleted if the output value was not used. This is no longer true for C calls; a C function might both have a side effect and return a value.

GC maps

Now that FFI calls no longer force the optimizer to sync all live values to the data and retain stacks, it can happen that SSA values are live across an FFI call. These values get spilled to the call stack by the register allocator. Spilling to the call stack is cheaper than spilling to the data and retain stacks, because floating point values and integers do not need to be boxed, and since the spilling is done later in the optimization process, optimizations can proceed across the call site instead of being stumped by pushes and pops on either side.

However, since FFI calls can invoke callbacks, which in turn run Factor code, which can trigger a garbage collection, the garbage collector must be able to identify spill slots in call frames which contain tagged pointers.

The technique I'm using is to record "GC maps". This idea comes from a paper titled Compiler Support for Garbage Collection in a Statically Typed Language (Factor is dynamically typed, and the paper itself doesn't have anything specific to static typing in it, so I found the title a bit odd). The Java HotSpot VM uses the same technique. The basic idea is that for every call site, you record a bitmap indicating which spill slots contain tagged pointers. This information is then stored in a per-code-block map, where the keys are return addresses and the values are these bitmaps.

In the future, I intend on using GC maps at call sites of Factor words as well, instead of spilling temporary values to the retain stack; then I can eliminate the retain stack altogether, freeing up a register. After this is done the data stack will only be used to pass parameters between words, and not to store temporaries within a word. This will allow more values to be unboxed in more situations, and it will improve accuracy of compiler analyses.

In fact, getting GC maps worked out was my primary motivation for this FFI rewrite; the code cleanups and performance improvements were just gravy.

Callback support improvements

Another one of those things that only makes sense when you look at how Factor evolved is that the body of an FFI callback was compiled with the non-optimizing compiler, rather than the optimizing compiler. It used to be that only certain definitions could be optimized, because static stack effects were optional and there were many common idioms which did not have static stack effects. It has been more than a year since I undertook the engineering effort to make the compiler enforce static stack safety, and the implementation of callbacks was the last vestigial remnant from the bad old days.

This design made callbacks harder to debug than they should be; if you used up too many parameters, or forgot to push a return value, you'd be greeted with a runtime error instead of a compiler error. Now this has been fixed, and callbacks are fully compiled with the optimizing compiler.

1 comment:

zimbatm said...

Very interesting post, as usual. Only missing section is about performance measurements to support the theory.