Thursday, July 10, 2008

Optional type declarations for tuple slots

I added a new feature which should help with safety, optimization, and making code more self-documenting; you can now declare initial values and types for tuple slots, and you can also declare a slot to be read-only.

Here is a tuple without any declarations:
TUPLE: person first-name last-name age children ;

Here we add some declarations:
TUPLE: person
{ first-name string }
{ last-name string }
{ age integer }
{ children sequence } ;

Here is a more complex example from the Clojure-inspired persistent-vectors vocabulary where we specify initial values:
TUPLE: persistent-vector
{ count fixnum }
{ root node initial: T{ node f { } 1 } }
{ tail node initial: T{ node f { } 0 } } ;

Initial values are nice,because they can eliminate the need for constructors which simplify fill in slots with constants.

Finally, slots can be declared read-only, in which case no writer method is generated:
TUPLE: range
{ from read-only }
{ length read-only }
{ step read-only } ;

Of course, we can combine type declarations with initial values or with read-only declarations.

I don't think this feature will be used much; after all, Factor is still dynamically-typed. However, in some cases it can help catch errors earlier, by raising an error when you store a bad value in a slot, rather than when the value is extracted and operated on. It can also make code easier to understand in cases where the potential values of the slot are not clear from context.

Finally, the compiler can use type information to optimize low-level code. This is the possibility I am most excited about. Consider the definition of an I/O buffer from the io.buffers vocabulary:
TUPLE: buffer
{ size fixnum }
{ ptr simple-alien }
{ fill fixnum }
{ pos fixnum }
disposed ;

The slots are all defined to have extremely low-level types. Now consider a word such as buffer-pop, which removes one character from a buffer:
: buffer-consume ( n buffer -- )
[ + ] change-pos
dup [ pos>> ] [ fill>> ] bi <
[ 0 >>pos 0 >>fill ] unless drop ; inline

: buffer-peek ( buffer -- byte )
[ ptr>> ] [ pos>> ] bi alien-unsigned-1 ; inline

: buffer-pop ( buffer -- byte )
[ buffer-peek ] [ 1 swap buffer-consume ] bi ;

HINTS: buffer-pop buffer ;

This word is extremely performance-sensitive, because it is called each time a character is read from a stream. When decoding text data with an encoding such as UTF8, this word will be called in a loop, so performance is critical here. As it is written, there is a lot of implied method dispatch in there; the accessor words, the calls to < and +, and alien-unsigned-1. The HINTS: declaration is used to tell the compiler to expect that the word will be called with a buffer instance on the stack; this causes the compiler to compile a variant of the word specialized on buffer instances. This eliminates the dispatch on accessors, however because formerly the types of the slots were unknown, the words which are called on the outputs from the accessors, such as < and alien-unsigned-1, still had to dispatch. Now that the types are declared, no dispatch is needed.

I should note that using type declarations on tuple slots does not always lead to a performance boost; in fact, the added type check in the writer method can reduce performance. If you plan on using this feature to speed up inner loops, then make sure the compiler is able to optimize out the type check on the writer as well. In the case of the buffers code above, the compiler is able to prove that all arithmetic only results in fixnums, and so the writers do not check and everything compiles into very efficient code.

Because the io.buffers code is now compiled more efficiently, performance on the reverse-complement benchmark has improved by a significant margin.

On a somewhat related note, I have begun work on a new code generator for Factor's optimizing compiler.

First, some background. The current compiler converts code into a "dataflow IR" (a tree structure, somewhat similar to CPS and SSA, but essentially just stack code with value annotations). This IR is then analyzed and optimized by a series of passes (type inference, inlining, constant folding, etc) which are applied repeatedly until a fixed-point is achieved.

Once the high-level optimization phase is done, the code generator traverses the IR and generates machine code, while performing additional optimizations concurrently with generating code (tail call optimization, float unboxing, moving stack values to registers). I originally implemented this code generator in April 2006 (1, 2,
3, 4). This code generator replaced an earlier, even more primitive one which essentially just concatenated chunks of machine code (with fixed register usage), followed by a peephole optimization pass.

The new code generator converts the dataflow IR into a control-flow graph of basic blocks of instructions. The instructions are rather low-level, but still machine-independent, and they operate on SSA "virtual registers". A series of optimization passes rewrite the CFG, followed by a linearization pass which flattens the CFG into a single sequence of instructions containing labels and jumps, followed by machine code generation.

The new code generator operates on instructions, which are a lot more fine-grained than Factor words, which are the basic unit of the dataflow IR. Factor primitives such as tag, slot and fixnum+ expand into several instructions. This fine-grained representation, together with the multi-pass nature of the new code generator will enable many optimizations which the current optimizer cannot do:
  • Global register allocation
  • Common subexpression elimination
  • Loop-invariant code motion
  • Induction variable elimination
  • Alias analysis (eliminating redundant reads and writes of tuple slots)
  • Effective usage of multiple condition registers on PowerPC
  • Reassociation of arithmetic calculations
  • More effective write barrier elimination
  • Instruction scheduling
  • Coalescing multiple memory allocations within a single basic block

I have implemented bits and pieces of the code generator. I'm still reading through some books and papers, and I also need to study the source code for Ocaml, LLVM and possibly other compilers in more detail. I hope to finish the new code generator within a month. Programming in Factor after the recent language changes such as tuple inheritance, new accessors, cleave combinators, etc is a real joy, even for complex algorithms such as SSA optimization passes.

Unlike the claims made by stack language detractors (who have never actually tried programming in them...) dealing with the stack is never a problem, and indeed it actually helps a lot with organizing code and passing data between words.

Amusingly enough, this is the third time in the last 3 years that I'm working on a major overhaul of the code generator stage, whereas the high-level optimizer has just evolved organically, and is still in pretty good shape. I guess I got that part right the first time.

The new code generator will take Factor's performance to the next level. Once it is completed, I will consider submitting Factor to the computer language shootout to kick some butt.

Monday, July 07, 2008

Factor crash with GCC 4.3.x resolved

A few months ago, James Cash discovered that GCC 4.3.0 incorrectly compiles the Factor VM with -O3 (the default). The problem still exists on GCC 4.3.1, apparently on both x86 and x86.64. I finally found a very simple testcase and submitted a bug.

Saturday, July 05, 2008

BLAS bindings and high-level wrapper library

Joe Groff has implemented a BLAS binding in Factor. This is really cool stuff, and should enable some neat applications which require high-performance linear algebra. Joe writes,
One day, Factor's compiler will generate such incomprehensibly fast
code for math.vectors and math.matrices that nobody will even consider
using C or Fortran ever again. In the meantime, you can now call out
to everyone's favorite Fortran fast-math library through a few
hopefully easy-to-use vocabs--check out math.blas.vectors,
math.blas.matrices, and, for purists, math.blas.cblas.

The code is now in our git repository.

On shaking trees

The application deploy tool was first added a year ago, and I've made steady improvements since then (see 1, 2, 3, 4, 5). The basic idea is that it takes a Factor image, loads your program in, then proceeds to strip out all the code for the parser, compiler, and reflection. After this is done, a GC pass eliminates unreachable code, and the only remaining code in the image is that which is directly referenced by the main word of your program.

About half a year ago, I added some tests which deploy various demos shipped with Factor, then ensure that the resulting binary is below a certain size. These tests always seem to start failing after I make major changes to the Factor core or compiler, because suddenly programs start pulling in code more than they should. It is certainly frustrating when I add a new language feature, debug it, then realize that all the tests pass except the deploy tests pass. Getting them to pass again sometimes involves a fair bit of effort, because I have to think of new tricks to reduce the image size again, and I'm pretty anal about not cheating and just changing the size limits in the tests. This is because it is a very satisfying feeling once I figure out a new trick, and the tests pass again.

For example, I just added optional type declarations to tuple slots, and rewrite a large portion of the low-level tuple class machinery. As usual, the deployed images blew up. My tool for figuring out how to reduce image size is Factor's low-level debugger, which can give a dump of the heap, find references to objects, and dump objects.

I started by moving bit-arrays, float-arrays and the inspector out of the core, so that they won't get loaded by default. This wasn't enough, however. The next step was having the tree shaker prune some word properties which were only used when loading new code, and not at runtime. Then, I noticed that there were a lot of duplicate, immutable byte arrays in the heap. This is the relocation info generated by the compiler to tell the VM about dependencies between code blocks, code GC information, and so on. Turns out that for a lot of different code blocks, the generated relocation info was the same. Replacing duplicate relocation blocks with references to the same object as the last stage in the tree shaker was enough to make the tests pass again. I also applied the same optimization to quotations and strings for good measure. After all these changes, the resulting reduction in deployed image sizes was so drastic that I went ahead and changed the deploy tests to tighten the size bounds even further -- I guess I'm just masochistic like that.

The deployed application sizes are pretty large compared to languages such as C and C++, however they're certainly much smaller than the dynamic language competition. The Tetris demo tree-shakes down to 1Mb; a console "hello world" is 340Kb; a GUI "hello world" is also around 1Mb; and the "bunny" demo, which downloads a 3d model via HTTP and renders it with OpenGL, tree-shakes down to 2Mb. In addition to the tree-shaken image, the only other dependency is the Factor VM (<200Kb), and on Mac OS X and Windows, the FreeType library (this dependency will go away at some point in the future). Furthermore, you get a nice double-clickable .app on Mac OS X so that the user is not even aware they're running something written in Factor; it feels like any other native executable.

I believe the ability to deploy small, stand-alone binaries which do not contain any source code and do not depend on an external runtime environment is an essential capability for a general-purpose language. This is also something that many other language environments do poorly; Java requires you to bundle a large (at least 8mb) runtime environment, Python and Ruby have no way of distributing code without the source, and the free Lisp environments can usually save customized images but don't attempt to strip them down in any way, leaving you with a 25Mb "Hello World".

Factor has a lot of great features: an interactive development environment, stand-alone application deployment, the web framework, SSL, database libraries, Unicode 5.1, XML, etc, not to mention automated binary builds with continuous testing on 11 platforms. We've come a long way in less than 5 years, and I believe this is all thanks to the incredible productivity offered by the Factor language itself. This pace of development is not going to let up any time soon, and all the people who ridiculed "yet another programming language" in Factor's early days are only going to look more and more foolish as time goes on.

Thursday, July 03, 2008

Don't set shared file descriptors to non-blocking

I learned something about Unix today which is probably common knowledge to many people, but was the first time I bumped into it: you shouldn't set shared (or inherited) file descriptors to non-blocking I/O mode.

After using io.launcher library on Unix, the stdin stream became a blocking file descriptor, and thus Factor threads did not run while the listener was waiting for input.

This was happening because the launcher forks, disables non-blocking mode on the stdio descriptors, then execs the child process; trouble is, I didn't realize that the nonblocking flag is shared between the two processes, much like the file pointer is shared.

The correct solution is then to have the Factor VM start a thread which reads from stdin and writes to a pipe. The other end of the pipe can be opened in non-blocking mode and made available to Factor's multiplexer.

On Windows, we already have this problem because the command prompt cannot be used with I/O completion ports. Looks like it is time to implement the threaded blocking I/O and solve this problem on both platforms.