Sunday, September 30, 2007

Improved intrinsics for alien accessors

Chris Double's Ogg Theora video player has served as a constant inspiration for ways to improve Factor's compiler. The key bottleneck is the YUV to RGB conversion.

In Factor 0.89, I implemented interval inference; this let the compiler convert some arithmetic operations to machine arithmetic.

In Factor 0.90, I added open-coded intrinsics for alien accessor words (words for reading/writing C values in memory); this allowed more values in the inner loop to be stored in registers, without incurring calls into the VM together with save/restore of values to the data stack.

In the last few days, I implemented some representation inference, which allows intermediate alien values to be stored as unboxed pointers. In the YUV to RGB inner loop, a pointer is read from a C struct; this pointer is then dereferenced. Prior to this, the pointer would be boxed in an alien instance when read, then unboxed when dereferenced. The compiler is now able to optimize this out.

The improvements in performance over the last year have been amazing. I don't remember how slow the Ogg player was when it was first added, but we're talking unusable slow -- something like 200 ms to convert a 480x320 frame of video from YUV to RGB on my Power Mac G5. When interval inference was added, the time went down to about 100ms per frame. Open-coded alien accessors improved this down to 30ms. And now the recent representation inference has again improved the decode time down to 15ms.

During this time the YUV to RGB conversion algorithm has not changed much; it still sits at the same level of abstraction (generic arithmetic, etc.) The compiler has done all the hard work of making it faster.

I still plan on working on the compiler in the future with the goal of making YUV to RGB faster. Now that the low-hanging fruit have been picked, and the YUV to RGB inner loop no longer calls into the VM or allocates memory, further improvements will come from better register allocation and instruction selection.

Friday, September 28, 2007

First impressions of git

So far, I like git a lot. My favorite feature is branches, and the goodies that come with that, such as git fetch, which pulls patches from a remote repository into a branch. I like using git add to flag files to commit more than the darcs "cherry picking", which is tedious and error-prone. I also like it how git status shows me any files which have not been added yet.

The gitk GUI, while not great, is a nice way to get an overview of what's going on with the repository.

Of course the best thing about git is the speed. The Linux Kernel developers use git, and they have a huge codebase with many developers. I'm confident that I've made the right choice by switching to git.

One thing I don't like about git is the complexity -- there are hundreds of commands and it certainly seems like the user interface could be simplified, for the common operations anyway. I'm slightly worried that the increased learning curve might be a hindrance to getting contributors in the future. However, having to wait 10 minutes to darcs push is a turn-off to contributing in itself.

Thursday, September 27, 2007

A neat trick for refactoring code

So today I'm going to be cleaning up one of the vocabularies involved in code generation in the optimizing compiler. Each vocabulary has a set of words which are used by other vocabularies -- the "public interface", and implementation words, those only used to implement the interface. I want to rewrite the generator.registers vocabulary but keep the public interface as close to its current state as possible. Trouble is, the public interface is not well-defined here; this vocabulary does not use <PRIVATE and PRIVATE>. However, this is not a big problem, since I can just write a tool:
: used-externally ( vocab -- words )
dup words [
usage
[ word? ] subset
[ word-vocabulary = not ] curry* contains?
] curry* subset natural-sort ;

How does this work? It takes the list of words in a vocabulary, then for each word, checks if the usages of that word include any words outside of this vocabulary. If the answer is affirmative, it adds this word to the sequence that is returned.

Using this tool I found out that generator.registers vocabulary only has a handful of words which are used externally, and so it should be easy to do a radical refactoring which doesn't require many changes to other modules.

It is hard to articulate how powerful this is. Not only is the Factor listener is a great way of testing code, it also provides a way to interact with the system on a meta-level. I wonder how much work it would be to implement something like used-externally in a conventional IDE such as Eclipse; I'm sure all the pieces are there (cross-referencing, etc) but it can't be that easy to just glue things together like that.

Saturday, September 22, 2007

How to publish a git repository

Now that I'm using git, all the contributors have to switch, too. Chris Double wrote a blog entry about publishing a git repository. Thanks Chris!

The Factor project switches to git

Why git? Because it is faster than darcs, and has kick-ass branching features.

To check out a repo:
git clone git://factorcode.org/git/factor.git/

You can find documentation at the git website.

A gitweb interface is coming really soon; we have to implement cgi support for the Factor HTTP server first.

The code in the repository runs on the following platforms right now; more will be added very soon:
  • Mac OS X/PowerPC
  • Mac OS X/x86
  • Linux/x86
  • Linux/AMD64
  • FreeBSD/x86

You will need the latest boot images.

The old darcs repository is still there. It contains the last revision of Factor to include an interpreter, and so represents a historical milestone of sorts.

Friday, September 21, 2007

Factor web site redesign

Thanks to Elie Chaftari, we have a new Factor web site design. Check it out before it goes live! (Which will probably happen with the release of Factor 0.91).

The "Latest News" sidebar is just a placeholder; I will replace it with an RSS aggregator soon.

Also if you look carefully, you'll see mention of a git repository. Yes, I'm switching to git! Today I encountered a darcs bug/flaw/limitation while merging in Eduardo's patches (it was not Ed's fault) and decided that this would be the very last time.

Right now the repository contains the in-progress quotation compiler/compiled continuations code, which only runs on Mac OS X/PowerPC, FreeBSD/x86 and (just barely) Mac OS X/x86. When it is more stable and supports a wider range of platforms, I will make an announcement and formally retire the darcs repository.

Contributors: please continue using darcs for the next few days, but be prepared to migrate to git Real Soon Now.

Thursday, September 20, 2007

Monday, September 17, 2007

Bignum libraries

So does anybody know of a simple, efficient, BSD-licensed bignum library where the radix size is equal to the CPU's word size, and not the CPU's word size minus two bits?

Factor x86 calling convention

I'm working on the x86 backend for the quotation compiler. This has given me an opportunity to tweak the calling convention used by compiled Factor code. I will outline the calling convention I use, and highlight the main differences from the standard C calling convention.

Registers are assigned as follows:
  • EAX - volatile scratch
  • ECX - volatile scratch
  • EDX - volatile scratch
  • EBX - volatile scratch
  • EBP - volatile scratch
  • ESI - data stack pointer
  • EDI - retain stack pointer
  • ESP - callstack pointer
Factor's compiler does not generate code which saves values in non-volatile registers, because that would complicate matters for the GC. Instead, the only location where values may be saved between subroutine calls is the data and retain stacks. So I use EBX and EBP as volatile scratch registers, since otherwise they'd be unused.

Stack frames are fixed-size, 16 bytes each, with no explicit back-chain pointers. Stack frames have the following layout:
  • 4 bytes current quotation position - used by code compiled with quotation compiler
  • 4 bytes current quotation - used by code compiled with quotation compiler
  • Pointer to start current of current code block - used by code GC as a root when scavenging the call stack
  • Return address for callee

Unlike typical x86 code which passes parameters on the call stack, I use registers to pass parameters to quotations and words.

Quotations are called with the following values in registers:
  • EAX - a tagged pointer to the quotation itself
  • ECX - a pointer to the start of the quotation's code block. The quotation's compiled definition saves this in the stack frame

Compound definitions and symbols are called with the following values in registers:
  • EAX - the word itself

Primitive words are called with the following values in registers:
  • EDX - the top of the call stack before the primitive call

Compiled words can be called without any parameters in registers.

Compiled words are responsible for a pointer to the start of their own in their stack frame, however the caller of a compiled quotation must do this for the quotation. This is because right now, compiled quotations must be fully position independent without requiring any relocation.

Quick poll of x86 users

Does anybody care about Factor running on chips older than the Pentium Pro? I would like to start using the "conditional move" instructions which are not present on Pentium MMX and older chips.

Thursday, September 13, 2007

Software engineering best practices

I submitted a bug via Apple's bug tracker 4 months ago. The bug is a data corruption issue in the kernel/libc and it is very easy to reproduce. I also suspect it is easy to fix, since it is a case of saving/restoring some registers in signal handlers. Apple has not fixed it, or even acknowledged that it exists. I'm willing to bet $100 that nobody at Apple has even read the bug report. It is very likely the entire public bug reporter is just for show, sort of like Sun's "bug parade".

As for the infamous gcc x86 backend bug, well the one I submitted was closed as a duplicate... of a bug which has been open since 2003.

Update: my friend at Apple tells me my bug was fixed in June, but I'd have to be a Premium or Select ADC member to see this, since apparently even bug fix notifications are covered by the Leopard beta NDA cloud.

Monday, September 10, 2007

Two minor incompatible changes as a result of two-tier compilation

Surpringly such a large overhaul of the VM has not resulted in many incompatible changes. There are two I'm aware of:
  • Quotations are immutable now -- I've mentioned already.
  • Passing f to call or if is an error now. Being able to call f was a hold-over from the cons cell days when f was the empty quotation. These days this has required an unnecessary type dispatch, and it is what I consider a rather awkward pun. So this is a type error now.

Two-tier compilation comes to Factor

Factor 0.90 uses the following execution model:
  • Words with a static stack effect are compiled to machine code by the optimizing compiler.
  • Words which use continuations but otherwise have static stack effects run in the interpreter.
  • Words with dynamic stack effects run in the interpreter.
  • Primitives which modify the code heap are considered to have a dynamic stack effect, only to make sure that words calling them do not compile. This is because adding a new compiled code block can trigger a code GC, and code GC is not smart enough to scan the native call stack to identify return addresses as roots.

The interpreter is a rather inefficient affair, quotations are stored on an interpreter callstack; execution of a quotation proceeds by inspecting each element. If the element is a word, it is executed, if it is a wrapped word, it is unwrapped, and otherwise, it is a literal pushed on the data stack. This entails a runtime type dispatch for each element of the quotation.

What I'm working on right now is a new execution model. The code isn't ready yet and probably won't be for another week, but here is how everything works so far:
  • Words with a static stack effect are compiled to machine code by the optimizing compiler, just like before.
  • Words which use continuations but otherwise have static stack effects are compiled to machine code by the optimizing compiler - this is new!
  • Words with dynamic stack effects are compiled to machine code by the new non-optimizing compiler.
  • The GC now knows about native stack frames and is now able to traverse them when scanning for roots.

This is really hardcore -- Factor does not have an interpreter of any form now, except for the single-stepper in tools.interpreter. The amount of code which is compiled optimized has increased now that I/O can compile (I/O uses continuations under the hood to implement non-blocking, and would run in the interpreter; a significant performance problem). Not only that, but the remaining, truly dynamic code is compiled by a "mop-up" compiler.

Ages ago, quotations used to be linked lists. When I made them a distinct type of their own, I had this form of "quotation compilation" in mind as an eventual possibility. See this old article, and this one. It is good that more than a year later I've finally carried over this transition to its logical conclusion.

Quotations are still sequences, but are now immutable, and have an additional slot holding a pointer to a compiled code block.

The new non-optimizing compiler is written in C due to bootstrap reasons, but it is very trivial: it simply unrolls the interpreter loop, eliminating the dynamic dispatch, and using the native call stack instead of a fake interpreter call stack. Everything it compiles is a concatenation of a fixed number of basic blocks, which are assembled and provided by Factor code. So the C side is essentially just a glorified use-case of memcpy(). For example,
[ 2 + * ]
compiles to very naive machine code which performs the following actions:
  • A stack frame prologue -- this is omitted if the only calls to words are in tail position
  • Push literal 2
  • Call + word
  • Tail call (jump to) * word

While the generated code is only 2x faster than the interpreter, compile time is very fast and I don't expect any performance problems for code which constructs quotations on the fly.

Because Factor now controls the stack frame layout from top to bottom, implementing capture and restoration of compiled call stacks is relatively trivial. Scanning them for GC roots is a bit trickier. Note that Factor's GC is not conservative in any way; it manages to precisely identify all roots. Some language implementors claim that conservative collection is the only way to scavenge the native stack for roots, but this is false.

And not having an interpreter is a political victory of sorts. Factor is not an interpreted language, period -- it's thoroughly natively compiled.

The Self language uses a similar, but more advanced implementation strategy; a fast, non-inlining compiler compiles all code, and a highly optimizing compiler compiles hot-spots. Factor's optimizing compiler compiles all code for which it can infer a stack effect, and it doesn't try to do runtime type feedback.

Sunday, September 02, 2007

Factor is now four years old

Time flies! I started working on Factor early September 2003. I can't believe that it has been this long already. I first announced Factor to the Internet in April 2004, after it had already been in development for some time. Here is the original announcement.

In the last year, we had a stready stream of releases, from Factor 0.84 to Factor 0.90, bringing many improvements to functionality, stability and performance. There are also several new contributors on board, which is great. While Factor 1.0 still seems far away, I feel that I'm making good progress and getting closer to achieving my goals.

Here are some code line counts:
  • Factor VM (written in C): 10285 lines
  • Factor core (written in Factor, core/ directory): 21016 lines
  • Factor libraries (written in Factor, extra/ directory): 58853 lines
  • Unit tests: 10101 lines
  • Documentation: 187405 words

Compared to the stats from last year, we see that:
  • the VM has only grown by 2000 lines (yet has evolved considerably in the time)
  • the core actually shrank because of better coding practices, and because the UI was moved out of the core
  • the amount of contributed libraries has more than doubled!
  • the documentation and unit tests have grown considerably too

I think this shows that we're moving in the right direction. I hope to release Factor 1.0 some time next year, and make it a 1.0 release worth the five year wait: an expressive language, with excellent performance, solid libraries, powerful tools and a robust implementation. And after I get a chance to build some large-scale applications with Factor and draw upon the experience gained, the language will continue to evolve into Factor 2.0. Watch out.

Saturday, September 01, 2007

Factor 0.90 now available

Factor 0.90 is now available. The web site hasn't been updated yet, so you can't browse the docs online right now; so instead just download it and read the docs in the UI!

Lots of changes:
Core
  • New module system. (Eduardo Cavazos)
  • Tuple constructors are defined differently now.
  • Mixin classes implemented; these are essentially extensible unions.
  • New float-array data type implements a space-efficient sequence of floats.
  • Moved <file-appender>, delete-file, make-directory, delete-directory words from libs/io into the core, and fixed them to work on more platforms.
  • New host-name word.
  • The directory word now outputs an array of pairs, with the second element of each pair indicating if that entry is a subdirectory. This saves an unnecessary stat call when traversing directory hierarchies, which speeds things up.
  • IPv6 is now supported, along with Unix domain sockets (the latter on Unix systems only). The stack effects of <client> and <server> have changed, since they now take generic address specifiers; see Networking.
  • The stage 2 bootstrap process is more flexible, and various subsystems such as help, tools and the UI can be omitted by supplying command line switches; see Command line switches for bootstrap.
  • The -shell command line switch has been replaced by a -run command line switch; see Command line switches for general usage.
  • Variable usage inference has been removed; the infer word no longer reports this information.
Tools
  • Stand-alone image deployment in tools.deploy vocabulary .
  • Stand-alone application bundle deployment on Mac OS X in tools.deploy.app vocabulary.
  • New vocabulary browser tool in the UI.
  • New profiler tool in the UI.
Extras
    Most existing libraries were improved when ported to the new module system; the most notable changes include:
  • asn1: ASN1 parser and writer. (Elie Chaftari)
  • benchmarks: new set of benchmarks.
  • cfdg: Context-free design grammar implementation; see http://www.chriscoyne.com/cfdg/. (Eduardo Cavazos)
  • cryptlib: Cryptlib library binding. (Elie Chaftari)
  • cryptlib.streams: Streams which perform SSL encryption and decryption. (Matthew Willis)
  • hints: Give type specialization hints to the compiler.
  • inverse: Invertible computation and concatenative pattern matching. (Daniel Ehrenberg)
  • ldap: OpenLDAP library binding. (Elie Chaftari)
  • locals: Efficient lexically scoped locals, closures, and local words.
  • mortar: Experimental message-passing object system. (Eduardo Cavazos)
  • openssl: OpenSSL library binding. (Elie Chaftari)
  • pack: Utility for reading and writing binary data. (Doug Coleman)
  • pdf: Haru PDF library binding. (Elie Chaftari)
  • qualified: Refer to words from another vocabulary without adding the entire vocabulary to the search path. (Daniel Ehrenberg)
  • roman: Reading and writing Roman numerals. (Doug Coleman)
  • scite: SciTE editor integration. (Clemens Hofreither)
  • smtp: SMTP client with support for CRAM-MD5 authentication. (Elie Chaftari, Dirk Vleugels)
  • tuple-arrays: Space-efficient packed tuple arrays. (Daniel Ehrenberg)
  • unicode: major new functionality added. (Daniel Ehrenberg)

Performance
  • The curry word now runs in constant time, and curried quotations can be called from compiled code; this allows for abstractions and idioms which were previously impractical due to performance issues. In particular, words such as each-with and map-with are gone; each-with can now be written as curry* each, and similarly for other -with combinators.
  • Improved generational promotion strategy in garbage collector reduces the amount of junk which makes its way into tenured space, which in turn reduces the frequency of full garbage collections.
  • Faster generic word dispatch and union membership testing.
  • Alien memory accessors are compiled as intrinsics where possible, which improves performance in code which iteroperates with C libraries.
Platforms
  • Networking support added for Windows CE. (Doug Coleman)
  • UDP/IP networking support added for all Windows platforms. (Doug Coleman)
  • Solaris/x86 fixes. (Samuel Tardieu)
  • Linux/AMD64 port works again.

A big thank you to all the contributors and testers who made this release possible.