Monday, September 10, 2007

Two-tier compilation comes to Factor

Factor 0.90 uses the following execution model:
  • Words with a static stack effect are compiled to machine code by the optimizing compiler.
  • Words which use continuations but otherwise have static stack effects run in the interpreter.
  • Words with dynamic stack effects run in the interpreter.
  • Primitives which modify the code heap are considered to have a dynamic stack effect, only to make sure that words calling them do not compile. This is because adding a new compiled code block can trigger a code GC, and code GC is not smart enough to scan the native call stack to identify return addresses as roots.

The interpreter is a rather inefficient affair, quotations are stored on an interpreter callstack; execution of a quotation proceeds by inspecting each element. If the element is a word, it is executed, if it is a wrapped word, it is unwrapped, and otherwise, it is a literal pushed on the data stack. This entails a runtime type dispatch for each element of the quotation.

What I'm working on right now is a new execution model. The code isn't ready yet and probably won't be for another week, but here is how everything works so far:
  • Words with a static stack effect are compiled to machine code by the optimizing compiler, just like before.
  • Words which use continuations but otherwise have static stack effects are compiled to machine code by the optimizing compiler - this is new!
  • Words with dynamic stack effects are compiled to machine code by the new non-optimizing compiler.
  • The GC now knows about native stack frames and is now able to traverse them when scanning for roots.

This is really hardcore -- Factor does not have an interpreter of any form now, except for the single-stepper in tools.interpreter. The amount of code which is compiled optimized has increased now that I/O can compile (I/O uses continuations under the hood to implement non-blocking, and would run in the interpreter; a significant performance problem). Not only that, but the remaining, truly dynamic code is compiled by a "mop-up" compiler.

Ages ago, quotations used to be linked lists. When I made them a distinct type of their own, I had this form of "quotation compilation" in mind as an eventual possibility. See this old article, and this one. It is good that more than a year later I've finally carried over this transition to its logical conclusion.

Quotations are still sequences, but are now immutable, and have an additional slot holding a pointer to a compiled code block.

The new non-optimizing compiler is written in C due to bootstrap reasons, but it is very trivial: it simply unrolls the interpreter loop, eliminating the dynamic dispatch, and using the native call stack instead of a fake interpreter call stack. Everything it compiles is a concatenation of a fixed number of basic blocks, which are assembled and provided by Factor code. So the C side is essentially just a glorified use-case of memcpy(). For example,
[ 2 + * ]
compiles to very naive machine code which performs the following actions:
  • A stack frame prologue -- this is omitted if the only calls to words are in tail position
  • Push literal 2
  • Call + word
  • Tail call (jump to) * word

While the generated code is only 2x faster than the interpreter, compile time is very fast and I don't expect any performance problems for code which constructs quotations on the fly.

Because Factor now controls the stack frame layout from top to bottom, implementing capture and restoration of compiled call stacks is relatively trivial. Scanning them for GC roots is a bit trickier. Note that Factor's GC is not conservative in any way; it manages to precisely identify all roots. Some language implementors claim that conservative collection is the only way to scavenge the native stack for roots, but this is false.

And not having an interpreter is a political victory of sorts. Factor is not an interpreted language, period -- it's thoroughly natively compiled.

The Self language uses a similar, but more advanced implementation strategy; a fast, non-inlining compiler compiles all code, and a highly optimizing compiler compiles hot-spots. Factor's optimizing compiler compiles all code for which it can infer a stack effect, and it doesn't try to do runtime type feedback.

7 comments:

Crest said...

How high is the syncronisation overhead with split L1 caches for short quotations?

Slava Pestov said...

crest: can you elaborate?

george said...
This comment has been removed by the author.
george said...

Nice. Probably just as fast or faster than Java for anything more than toy benchmarks because Java's GC is slow as a dog. Chicken is a nice implementation of scheme witch is faster on AMD machines because of the amount of GC necessary for its implementation to not get a stack overflow.

Slava Pestov said...

george: unfortunately we still have some ways to go before we're as fast as sbcl.

Instead of a switch to enable/disable the optimizing compiler, I plan to make the optimizing compiler faster (in terms of compile time).

As for a paper, well, I'm going to document the optimizing compiler implementation when things settle down. Not sure about a paper; maybe one day.

george said...

Sorry i got too excited. What i would like is a way to prevent recompletion of the whole frickin thing every time you deploy an app thats whats annoying about completion maybe. whats the speed ratio with ruby/python/perl they should get creamed. I think your about as fast as cmucl because that is stagnated over the last few years as sbcl won the fork.

george said...

But hey at least the C takes 10 sec's max to compile which is amazing. You can't say that about many languages.