Thursday, July 10, 2008

Optional type declarations for tuple slots

I added a new feature which should help with safety, optimization, and making code more self-documenting; you can now declare initial values and types for tuple slots, and you can also declare a slot to be read-only.

Here is a tuple without any declarations:
TUPLE: person first-name last-name age children ;

Here we add some declarations:
TUPLE: person
{ first-name string }
{ last-name string }
{ age integer }
{ children sequence } ;

Here is a more complex example from the Clojure-inspired persistent-vectors vocabulary where we specify initial values:
TUPLE: persistent-vector
{ count fixnum }
{ root node initial: T{ node f { } 1 } }
{ tail node initial: T{ node f { } 0 } } ;

Initial values are nice,because they can eliminate the need for constructors which simplify fill in slots with constants.

Finally, slots can be declared read-only, in which case no writer method is generated:
TUPLE: range
{ from read-only }
{ length read-only }
{ step read-only } ;

Of course, we can combine type declarations with initial values or with read-only declarations.

I don't think this feature will be used much; after all, Factor is still dynamically-typed. However, in some cases it can help catch errors earlier, by raising an error when you store a bad value in a slot, rather than when the value is extracted and operated on. It can also make code easier to understand in cases where the potential values of the slot are not clear from context.

Finally, the compiler can use type information to optimize low-level code. This is the possibility I am most excited about. Consider the definition of an I/O buffer from the io.buffers vocabulary:
TUPLE: buffer
{ size fixnum }
{ ptr simple-alien }
{ fill fixnum }
{ pos fixnum }
disposed ;

The slots are all defined to have extremely low-level types. Now consider a word such as buffer-pop, which removes one character from a buffer:
: buffer-consume ( n buffer -- )
[ + ] change-pos
dup [ pos>> ] [ fill>> ] bi <
[ 0 >>pos 0 >>fill ] unless drop ; inline

: buffer-peek ( buffer -- byte )
[ ptr>> ] [ pos>> ] bi alien-unsigned-1 ; inline

: buffer-pop ( buffer -- byte )
[ buffer-peek ] [ 1 swap buffer-consume ] bi ;

HINTS: buffer-pop buffer ;

This word is extremely performance-sensitive, because it is called each time a character is read from a stream. When decoding text data with an encoding such as UTF8, this word will be called in a loop, so performance is critical here. As it is written, there is a lot of implied method dispatch in there; the accessor words, the calls to < and +, and alien-unsigned-1. The HINTS: declaration is used to tell the compiler to expect that the word will be called with a buffer instance on the stack; this causes the compiler to compile a variant of the word specialized on buffer instances. This eliminates the dispatch on accessors, however because formerly the types of the slots were unknown, the words which are called on the outputs from the accessors, such as < and alien-unsigned-1, still had to dispatch. Now that the types are declared, no dispatch is needed.

I should note that using type declarations on tuple slots does not always lead to a performance boost; in fact, the added type check in the writer method can reduce performance. If you plan on using this feature to speed up inner loops, then make sure the compiler is able to optimize out the type check on the writer as well. In the case of the buffers code above, the compiler is able to prove that all arithmetic only results in fixnums, and so the writers do not check and everything compiles into very efficient code.

Because the io.buffers code is now compiled more efficiently, performance on the reverse-complement benchmark has improved by a significant margin.

On a somewhat related note, I have begun work on a new code generator for Factor's optimizing compiler.

First, some background. The current compiler converts code into a "dataflow IR" (a tree structure, somewhat similar to CPS and SSA, but essentially just stack code with value annotations). This IR is then analyzed and optimized by a series of passes (type inference, inlining, constant folding, etc) which are applied repeatedly until a fixed-point is achieved.

Once the high-level optimization phase is done, the code generator traverses the IR and generates machine code, while performing additional optimizations concurrently with generating code (tail call optimization, float unboxing, moving stack values to registers). I originally implemented this code generator in April 2006 (1, 2,
3, 4). This code generator replaced an earlier, even more primitive one which essentially just concatenated chunks of machine code (with fixed register usage), followed by a peephole optimization pass.

The new code generator converts the dataflow IR into a control-flow graph of basic blocks of instructions. The instructions are rather low-level, but still machine-independent, and they operate on SSA "virtual registers". A series of optimization passes rewrite the CFG, followed by a linearization pass which flattens the CFG into a single sequence of instructions containing labels and jumps, followed by machine code generation.

The new code generator operates on instructions, which are a lot more fine-grained than Factor words, which are the basic unit of the dataflow IR. Factor primitives such as tag, slot and fixnum+ expand into several instructions. This fine-grained representation, together with the multi-pass nature of the new code generator will enable many optimizations which the current optimizer cannot do:
  • Global register allocation
  • Common subexpression elimination
  • Loop-invariant code motion
  • Induction variable elimination
  • Alias analysis (eliminating redundant reads and writes of tuple slots)
  • Effective usage of multiple condition registers on PowerPC
  • Reassociation of arithmetic calculations
  • More effective write barrier elimination
  • Instruction scheduling
  • Coalescing multiple memory allocations within a single basic block

I have implemented bits and pieces of the code generator. I'm still reading through some books and papers, and I also need to study the source code for Ocaml, LLVM and possibly other compilers in more detail. I hope to finish the new code generator within a month. Programming in Factor after the recent language changes such as tuple inheritance, new accessors, cleave combinators, etc is a real joy, even for complex algorithms such as SSA optimization passes.

Unlike the claims made by stack language detractors (who have never actually tried programming in them...) dealing with the stack is never a problem, and indeed it actually helps a lot with organizing code and passing data between words.

Amusingly enough, this is the third time in the last 3 years that I'm working on a major overhaul of the code generator stage, whereas the high-level optimizer has just evolved organically, and is still in pretty good shape. I guess I got that part right the first time.

The new code generator will take Factor's performance to the next level. Once it is completed, I will consider submitting Factor to the computer language shootout to kick some butt.

No comments: