Factor: a practical stack language

Saturday, September 18, 2010

Factor 0.94 now available

Factor 0.94 is now available from the Factor website, five months after the previous release, Factor 0.93. Binaries are provided for 10 platforms.

As usual, contributors did most of the work. Thanks to Daniel Ehrenberg, Dmitry Shubin, Doug Coleman, Erik Charlebois, Joe Groff, John Benediktsson, Jose A. Ortega Ruiz, Niklas Waern, Samuel Tardieu, Sascha Matzke and everyone else who helped out this time around!

Incompatible changes:

The PowerPC architecture is no longer supported. (Slava Pestov)
The require-when word now supports dependencies on multiple vocabularies. (Daniel Ehrenberg)
The C-ENUM: word in the C library interface has been replaced with ENUM:, a much improved word for defining type-safe enumerations. (Erik Charlebois, Joe Groff)
Tuple slot setter words with stack effect ( value object -- ) are now named foo<< instead of (>>foo). Most code is unaffected since it uses the >>foo form. (Slava Pestov)
The older system-micros word, which returned microseconds since the Unix epoch as an integer, has been removed. For a while, the recommended way to get the current time has been to call the now word from the calendar vocabulary, which returned a timestamp instance. (Doug Coleman)
A few sequence-related words were moved from the generalizations vocabulary to sequences.generalizations. (Slava Pestov)
The alarms vocabulary has been renamed to timers to better explain its true purpose, with improved timing accuracy and robustness. (Doug Coleman)
cocoa.subclassing: the syntax for defining new Objective-C classes has been changed to improve readability. (Slava Pestov)
io.streams.limited: the ability to throw errors on EOF was extracted from limited streams, and limited streams simplified as a result. Throwing on EOF is now implemented by the io.streams.throwing vocabulary. (Doug Coleman)

New libraries:

boyer-moore: efficient text search algorithm (Dmitry Shubin)
checksums.internet: implementation of checksum algorithm used by ICMP for the checksums framework (John Benediktsson)
gdbm: disk-based hash library binding (Dmitry Shubin)
io.encodings.detect: binary file/text encoding detection heuristics from jEdit (Joe Groff)
javascriptcore: FFI to the WebKit JavaScript engine (Doug Coleman)
lua: FFI to the Lua scripting language (Erik Charlebois)
oauth: minimal implementation of client-side OAuth (Slava Pestov)
sequences.unrolled: efficient unrolled loops with constant iteration count (Joe Groff)

Improved libraries:

cuda: various improvements (Joe Groff, Doug Coleman)
game.input: now uses XInput2 on X11 (Niklas Waern)
gpu.render, gpu.buffers: various improvements (Joe Groff)
math.combinatorics: various improvements (John Benediktsson)
math.primes: various improvements (Samuel Tardieu)
math.vectors.simd.cords: compound "256-bit" SIMD types now support the full set of SIMD operations (Joe Groff)
mongodb: various improvements (Sascha Matzke)
twitter: now uses OAuth as required by Twitter (Slava Pestov)
unix.users, unix.groups: various improvements (Doug Coleman)

Compiler improvements:

Improved instruction selection, copy propagation, representation selection, and register allocation; details in a blog post. (Slava Pestov)
An instruction scheduling pass now runs prior to register allocation, intended to reduce register pressure by moving uses closer to definitions; details in a blog post. (Daniel Ehrenberg)
The code generation for the C library interface has been revamped; details in a blog post. (Slava Pestov)
Something similar to what C++ and C# programmers refer to as "value types"; binary data can now be allocated on the call stack, and passed to C functions like any other pointer. The with-out-parameters combinator replaces tricky code for allocating and juggling multiple temporary byte arrays used as out parameters for C function calls, making this idiom easier to read and more efficient. The with-scoped-allocation combinator presents a more general, lower-level interface. (Slava Pestov)
The compiler can now use the x87 floating point unit on older CPUs where SSE2 is not available. However, this is not recommended, because the build farm does not test Factor (or build any binaries) in x87 mode, so this support can break at any time. To use x87 code generation, you must download the source code and bootstrap Factor yourself, on a CPU without SSE2. (Slava Pestov)

Miscellaneous improvements:

fuel: Factor's Ultimate Emacs Library has seen many improvements, and also some keyboard shortcuts have changed; see the README. (Erik Charlebois, Dmitry Shubin, Jose A. Ortega Ruiz)
A new factor.cmd script is now included in the build-support directory, to automate the update/build/bootstrap cycle for those who build from source. Its functionality is a subset of the factor.sh script for Unix. (Joe Groff)
The default set of icons shipped in misc has been tweaked, with better contrast and improved appearance when scaled down. (Joe Groff)

Saturday, September 11, 2010

An overview of Factor's I/O library

Factor has grown a very powerful and high-level I/O library over the years, however it is easy to get lost in the forest of reference documentation surrounding the io vocabulary hierarchy. In this blog post I'm attempting to give an overview of the functionality available, with some easy-to-digest examples, along with links for futher reading. I will also touch upon some common themes that come up throughout the library, such as encoding support, timeouts, and uses for dynamically-scoped variables.

Factor's I/O library is the work of many contributors over the years. Implementing FFI bindings to native I/O APIs, developing high-level abstractions on top, and making the whole thing cross-platform takes many people. In particular Doug Coleman did a lot of heavy lifting early on for the Windows port, and also implemented several new cross-platform features such as file system metadata and memory mapped files.

Our I/O library is competitive with Python's APIs and Java's upcoming NIO2 in breadth of functionality. I like to think the design is quite a bit cleaner too, because instead of being a thin wrapper over POSIX we try to come up with clear and conherent APIs that make sense on both Windows and Unix.

First example: converting a text file from MacRoman to UTF8

The io.files vocabulary defines words for reading and writing files. It supports two modes of operation in a pretty standard fashion:

Stream-based: with-file-reader, with-file-writer
Entire file: file-contents, set-file-contents, file-lines, set-file-lines

What makes Factor's file I/O interesting is that it takes advantage of pervasive support for I/O encoding. In Factor, a string is not a sequence of bytes; it is a sequence of Unicode code points. When reading and writing strings on external resources, which only consist of bytes, an encoding parameter is given to specify the conversion from strings to byte arrays.

Let's convert foo.txt from MacRoman, an older encoding primarily used by classic Mac OS, to UTF8:

USING: io.encodings.8-bit.mac-roman io.encodings.utf8 io.files ;

"foo.txt" mac-roman file-contents
"foo.txt" utf8 set-file-contents

This is a very simple and concise implementation but it has the downside that the entire file is read into memory. For most small text files this does not matter, but if efficiency is a concern then we can do the conversion a line at a time:

USING: io io.encodings.8-bit.mac-roman io.encodings.utf8
io.files ;

"out.txt" utf8 [
    "in.txt" mac-roman [
        [ print ] each-line
    ] with-file-reader
] with-file-writer

Converting a directory full of files from MacRoman to UTF8

The io.files vocabulary defines words for listing and modifying directories. Let's make the above example more interesting by performing the conversion on a directory full of files:

USING: io.directories io.encodings.8-bit.mac-roman
io.encodings.utf8 io.files ;

: convert-directory ( path -- )
    [
        [
            [ mac-roman file-contents ] keep
            utf8 set-file-contents
        ] each
    ] with-directory-files ;

An aside: generalizing the "current working directory"

If you run the following, you will see that with-directory-files returns relative, and not absolute, file names:

"/path/to/some/directory"
[ [ print ] each ] with-directory-files

So the question is, how did file-contents above know what directory to look for files in? The answer is that in addition to calling the quotation with the directory's contents, the with-directory-files word also rebinds the current-directory dynamic variable.

This directory is the Factor equivalent of the familiar Unix notion of "current working directory". It generalizes the Unix feature by making it dynamically-scoped; within the quotation passed to the with-directory combinator, relative paths are resolved relative to that directory, but other coroutines executing at the time, or code after the quotation, is unaffected. This functionality is implemented entirely at the library level; all pathname strings are "normalized" with the normalize-pathname word before being handed off to the operating system.

When calling a shell command with io.launcher, the child process is run from the Factor current-directory so relative pathnames passed on the command line will just work. However, when making C FFI calls which take pathnames, you pass in absolute paths only, or normalize the path with normalize-path first, otherwise the C code wlll search for it in the wrong place.

Checking free disk space

The io.files.info vocabulary defines two words which return tuples containing information about a file, and the file system containing the file, respectively:

Let's say your application needs to install some files in the user's home directory, but instead of failing half-way through in the event that there is insufficient space, you'd rather display a friendly error message upfront:

ERROR: buy-a-new-disk ;

: gb ( m -- n ) 30 2^ * ;

: check-space ( -- )
    home file-system-info free-space>> 10 gb <
    [ buy-a-new-disk ] when ;

Now if there is less than 10gb available, the check-space word will throw a buy-a-new-disk error.

The file-system-info word reports a bunch of other info. There is a Factor implementation of the Unix df command in the tools.files vocabulary:

( scratchpad ) file-systems.
+device-name+ +available-space+ +free-space+ +used-space+ +total-space+ +percent-used+ +mount-point+
/dev/disk0s2  15955816448       16217960448  183487713280 199705673728  91             /
fdesc         0                 0            1024         1024          100            /dev
fdesc         0                 0            1024         1024          100            /dev
map -hosts    0                 0            0            0             0              /net
map auto_home 0                 0            0            0             0              /home
/dev/disk1s2  15922262016       15922262016  383489052672 399411314688  96             /Users/slava

Doug has two blog posts about these features, part 1 and part 2.

Unix only: symbolic links

Factor knows about symbolic links on Unix. The io.files.links vocabulary defines a pair of words, make-link and make-hard-link. The link-info word is like file-info except it doesn't follow symbolic links. Finally, the directory hierarchy traversal words don't follow links, so a link cycle or bogus link to / somewhere won't break everything.

File system monitoring

The io.monitors vocabulary implements real-time file and directory change monitoring. Unfortunately at this point in time, it is only supported on Windows, Linux and Mac. Neither one of FreeBSD and OpenBSD exposes the necessary information to user-space.

Here is an example for watching a directory for changes, and logging them:

USE: io.monitors

: watch-loop ( monitor -- )
    dup next-change path>> print flush watch-loop ;

: watch-directory ( path -- )
    [ t [ watch-loop ] with-monitor ] with-monitors ;

Try pasting the above code into a Factor listener window, and then run home watch-directory. Every time a file in your home directory is modified, its full pathname will be printed in the listener.

Java will only begin to support symbolic links and directory monitoring in the upcoming JDK7 release.

Memory mapped files

The io.mmap vocabulary defines support for working with memory-mapped files. The highest-level and easiest to use interface is the with-mapped-array combinator. It takes a file name, a C type, and a quotation. The quotation can perform generic sequence operations on the mapped file.

Here is an example which reverses each group of 4 bytes:

USING: alien.c-types grouping io.mmap sequences
specialized-arrays ;
SPECIALIZED-ARRAY: char

"mydata.dat" char [
    4 <sliced-groups>
    [ reverse! drop ] each
] with-mapped-array

The <sliced-groups> word returns a view of an underlying sequence, grouped into n-element subsequences. Mutating one of these subsequences in-place mutates the underlying sequence, which in our case is a mapped view of a file.

A more efficient implementation of the above is also possible, by mapping in the file as an int array and then performing bitwise arithmetic on the elements.

Launching processes

Factor's io.launcher vocabulary was originally developed for use by the build farm. The build farm needs to launch processes with precise control over I/O redirection and timeouts, and so a rich set of cross-platform functionality was born.

The central concept in the library is the process, tuple, constructed by calling <process>. Various slots of the process tuple can be filled in to specify the command line, environment variables, redirection, and so on. Then the process can be run in various ways, running in the foreground, in the background, or with input and output attached to Factor streams.

The launcher's I/O redirection is very flexible. If you don't touch the redirection slots in a process tuple, the subprocess will just inherit the current standard input and output. You can specify a file name to read or write from, a file name to append to, or even supply a pipe object, constructed from the io.pipes vocabulary.

<process>
    "rotate-logs" >>command
    +closed+ >>stdin
    "out.txt" >>stdout
    "error.log" <appender> >>stderr

It is possible to specify a timeout when running a process:

<process>
    { "ssh" "myhost" "-l" "jane" "do-calculation" } >>command
    15 minutes >>timeout
    "results.txt" >>stdout
run-process

The process will be killed if it runs for longer than the timeout period. Many other features are supported; setting environment variables, setting process priority, and so on. The io.launcher vocabulary has all the details.

Support for timeouts is a cross-cutting concern that touches many ports of the I/O API. This support is consolidated in the io.timeouts vocabulary. The set-timeout generic word is supported by all external resources which provide interruptible blocking operations.

Timeouts are implemented on top of our monotonic timer support; changing your system clock while Factor is running won't screw with active network connections.

Unix only: file ownership and permissions

The io.files.unix vocabulary defines words for reading and writing file ownership and permissions. Using this vocabulary, we can write a shell script to a file, make it executable, and run it. An essential component of any multi-language quine:

USING: io.encodings.ascii io.files io.files.info.unix
io.launcher ;

"""
#!/bin/sh
echo "Hello, polyglot"
""" "script.sh" ascii set-file-contents
OCT: 755 "script.sh" set-file-permissions
"./script.sh" run-process

There are even more Unix-specific words in the unix.users and unix.groups vocabularies. Using these words enables listing all users on the system, converting user names to UIDs and back, and even setuid and setgid.

Networking

Factor's io.sockets vocabulary supports stream and packet-based networking.

Stream-based: with-client, <server>, accept
Packet-based: <datagram>, send, receive

Network addresses are specified in a flexible manner. Specific classes exist for IPv4, IPv6 and Unix domain socket addressing. When a network socket is constructed, that endpoint is bound to a given address specifier.

Connecting to http://www.apple.com, sending a GET request, and reading the result:

USING: io io.encodings.utf8 io.sockets ;

"www.apple.com" 80 <inet> utf8 [
    """GET / HTTP/1.1\r
host: www.apple.com\r
connection: close\r
\r
""" write flush
    contents
] with-client
print

SSL support is almost transparent; the only difference is that the address specifier is wrapped in <secure>:

USING: io io.encodings.utf8 io.sockets
io.sockets.secure ;

"www.cia.gov" 443 <inet> <secure> utf8 [
    """GET / HTTP/1.1\r
host: www.cia.gov\r
connection: close\r
\r
""" write flush
    contents
] with-client
print

For details, see the io.sockets.secure documentation, and my blog post about SSL in Factor..

Of course you'd never send HTTP requests directly using sockets; instead you'd use the http.client vocabulary.

Network servers

Factor's io.servers.connection vocabulary is so cool, that a couple of years back I made a screencast about it. Nowadays the sample application developed in that screencast is in the extra/time-server; the implementation is very concise and elegant.

Under the hood

All of this functionality is implemented in pure Factor code on top of our excellent C library interface and extensive bindings to POSIX and Win32 in the unix and windows vocabulary hierarchies, respectively.

As much as possible, I/O is performed with non-blocking operations; synchronous reads and writes only suspend the current coroutine and switch to the next runnable one rather than hanging the entire VM. I recently rewrote the coroutines implementation to use direct context switching rather than continuations.

Co-ordination and scheduling of coroutines is handled with a series of simple concurrency abstractions.

Sunday, September 05, 2010

Making Factor's continuous build system more robust

I've done some work on Factor's continuous build system over the weekend to make it more robust in the face of failure, with improved error reporting and less manual intervention required to fix problems when they come up. The current build system is called "mason", because it is based on an earlier build system named "builder" that was written by Eduardo Cavazos. Every binary package you download from factorcode.org was built, tested and benchmarked by mason.

Checking for disk space

Every once in a while build machines run out of disk space. This is a condition that Git doesn't handle very gracefully; if a git pull fails half way through, it leaves the repository in an inconsistent state. Instead of failing during source code checkout or testing, mason now checks disk usage before attempting a build. If less than 1 Gb is free, it sends out a warning e-mail and takes no further action. Disk usage is also now part of every build report; for example, take a look at the latest Mac OS X report. Finally, mason does a better job of cleaning up after itself when builds fail, reducing the rate of disk waste overall.

I must say the disk space check was very easy to implement using Doug Coleman's excellent cross-platform file-system-info library. Factor's I/O libraries are top-notch and everything works as expected across all of the platforms we run on.

Git repository corruption

Git is not 100% reliable, and sometimes repositories will end up in a funny state. One cause is when the disk fills up in the middle of a pull, but it seems to happen in other cases too. For example, just a few days ago, our 64-bit Windows build machine started failing builds with the following error:

From git://factorcode.org/git/factor
 * branch            master     -> FETCH_HEAD
Updating d386ea7..eece1e3

error: Entry 'basis/io/sockets/windows/windows.factor' not uptodate. Cannot merge.

Of course nobody actually edits files in the repository in question, its a clone of the official repo that gets updated every 5 minutes. Why git messed up here I have no clue, but instead of expecting software to be perfect, we can design for failure.

If a pull fails with a merge error, or if the working copy somehow ends up containing modified or untracked files, mason deletes the repository and clones it again from scratch, instead of just falling over and requiring manual intervention.

Error e-mail throttling

Any time mason encounters an error, such as not being able to pull from the Factor Git repository, disk space exhaustion, or intermittent network failure, it sends out an e-mail to Factor-builds. Since it checks for new code every 5 minutes, this can get very annoying if there is a problem with the machine and nobody is able to fix it immediately; the Factor-builds list would get spammed with hundreds of duplicate messages. Now, mason uses a heuristic to limit the number of error e-mails sent out. If two errors are sent within twenty minutes of each other, no further errors are sent for another 6 hours.

More robust new code check

Previously mason would initiate a build if a git pull had pulled in patches. This was insufficient though, because if a build was killed half way through, for example due to power failure or machine reboot, it would not re-attempt a build when it came back up until new patches were pushed. Now mason compares the latest git revision with the last one that was actually built to completion (whether or not there were errors).

Build system dashboard

I've put together a simple dashboard page showing build system status. Sometimes VMs will crash (FreeBSD is particularly flaky when running under VirtualBox, for example) and we don't always notice that a VM is down until several days after, when no builds are being uploaded. Since mason now sends out heartbeats every 5 minutes to a central server, it was easy to put together a dashboard showing which machines have not sent any heartbeats for a while. These machines are likely to be down. The dashboard also allows a build to be forced even if no new code was pushed to the repository; this is useful to test things out after changing machine configuration.

The dashboard nicely complements my earlier work on live status display for the build farm.

Conclusion

I think mason is one of the most advanced continuous integration systems among open source language implementations, nevermind the less mainstream ones such as Factor. And thanks to Factor's advanced libraries, it is only 1600 lines of code. Here is a selection of the functionality from Factor's standard library used by mason:

io.files.info - checking disk usage
io.launcher - running processes, such as git, make, zip, tar, ssh, and of course the actual Factor instance being tested
io.timeouts - timeouts on network operations and child processes are invaluable; Factor's consistent and widely-used timeout API makes it easy
http.client - downloading boot images, making POST requests to builds.factorcode.org for the live status display feature
smtp - sending build report e-mails
twitter - Tweeting binary upload notifications
oauth - yes, Factor has a library to support the feared OAuth. Everyone complains about how hard OAuth is, but if you have easy to use libraries for HMAC, SHA1 and HTTP then it's no big deal at all.
xml.syntax - constructing HTML-formatted build reports using XML literal syntax

Friday, September 03, 2010

Two things every Unix developer should know

Unix programming can be tricky. There are many subtleties many developers are not aware of. In this post, I will describe just two of them... my favorite Unix quirks, if you will.

Interruptible system calls

On Unix, any system call which blocks can potentially fail with an errno of EINTR, which indicates that the caller must retry the system call. The EINTR error can be raised at any time for any reason, so essentially every I/O operation on a Unix system must be prepared to handle this error properly. Surprisingly to some, this includes the C standard library functions such as fread(), fwrite(), and so on.

For example, if you are writing a network server, then most of the time, you want to ignore the SIGPIPE signal which is raised when the client closes its end of a socket. However, this ignored signal can cause some pending I/O in the server to return EINTR.

A commonly-held belief is that setting the SA_RESTART flag with the sigaction() system call means that if that signal is delivered, system calls are restarted for you and EINTR doesn't need to be handled. Unfortunately this is not true. The reason is that certain signals are unmaskable. For instance, on Mac OS X, if your process is blocking reading on standard input, and the user suspends the program by sending it SIGSTOP (usually by pressing ^Z in the terminal), then upon resumption, your read() call will immediately fail with EINTR.

Don't believe me? The Mac OS X cat program is not actually interrupt-safe, and has this bug. Run cat with no arguments in a terminal, press ^Z, then type %1, and you'll get an error from cat!

$ cat
^Z
[1]+  Stopped                 cat
$ %1
cat
cat: stdin: Interrupted system call
$

As far as I'm aware, Factor properly handles interruptible system calls, and has for a while now, thanks to Doug Coleman explaining the issue to me 4 years ago. Not having to deal with crap like this (not to mention being able to write cross-platform code that runs on both Unix and Windows) is one of the advantages of using a high-level language like Factor or Java over C.

Subprocesses inherit semi-random things from the parent process

When you fork() your process, various things are copied from the parent to the child; environment variables, file descriptors, the ignored signal mask, and so on. Less obvious is the fact that exec() doesn't reset everything. If shared file descriptors such as stdin and stdout were set to non-blocking in the parent, the child will start with these descriptors non-blocking also, which will most likely break most programs. I've blogged about this problem before.

A similar issue is that if you elect to ignore certain signals with the SIG_IGN action using sigaction(), then subprocesses will inherit this behavior. Again, this can break processes. Until yesterday, Factor would ignore SIGPIPE using this mechanism, and child processes spawned with the io.launcher vocabulary that expected to receive SIGPIPE would not work properly. There are various workarounds; you can reset the signal mask before the exec() call, or you can do what I did in Factor, and ignore the signal by giving it an empty signal handler instead of a SIG_IGN action.

Wednesday, July 28, 2010

Overhauling Factor's C library interface

For the last while I've been working on an overhaul of Factor's C library interface and associated compiler infrastructure. The goals of the rewrite were three-fold:

Improving performance
Cleaning up some crusty old code that dates back to my earliest experiments with native code generation
Laying the groundwork for future compiler improvements

These changes were all behind-the-scenes; I did not introduce any new functionality or language changes, and I think Factor's FFI is already quite stable and feature-complete.

The previous FFI implementation

I started work on Factor's C library interface (FFI) almost immediately after I bootstrapped the native implementation off of JFactor. I began experimenting with an SDL-based UI early on and immediately decided I wanted to have a real FFI, instead of extending the VM with primitives which wrap C functions by hand.

Over time, both the FFI and compiler have evolved in parallel, but there was little integration between them, other than the fact that both used the same assembler DSL under the hood. The result is that FFI calls were generated in a somewhat inefficient way. Since the optimizing compiler had no knowledge of how they work, it would save all values to the data stack first. Then the FFI code generator would take over; it would pop the input parameters one by one, pass them to a per-type C function in the Factor VM which unboxed the value, then stored the value in the stack frame. Finally, the target C function would be invoked, then another C function in the VM would box the return value, and the return value would be pushed to the stack. The optimizing compiler would then take over, possibily generating code to pop the value from the stack.

The redundant stack traffic was wasteful. In some cases, the optimizing compiler would generate code to box a value and push it to the stack, only to have the FFI then generate code to pop it and unbox it immediately after. To make matters worse, over time the optimizing compiler gained the ability to box and unbox values with open-coded assembly sequences, but the FFI would still make calls into the VM to do it.

All in all, it was about time I rewrote the FFI, modernizing it and integrating it with the rest of the compiler in the process.

Factoring FFI calls into simpler instructions

Most low-level IR instructions are very simple; FFI calls used to be the exception. Now I've split these up into smaller instructions. Parameters and return values are now read and written from the stack with the same ##peek and ##replace instructions that everything else uses, and boxing and unboxing parameters and return values is now done with the representation selection pass. A couple of oddball C types, such as long long on x86-32, still require VM calls to box and unbox, and I added new instructions for those.

One slightly tricky thing that came up was that I had to re-engineer low-level IR to support instructions with multiple output values. This is required for C calls which return structs and long long types by value, since each word-size chunk of the return value is returned in its own register, and these chunks have to be re-assembled later. In the future, I will be able to use this support to add instructions such as the x86 division instruction, which computes x / y and x mod y simultaneously.

I also had to change low-level IR to distinguish between instructions with side effects and those without. Previously, optimization passes would assume any instruction with an output value did not have side effects, and could be deleted if the output value was not used. This is no longer true for C calls; a C function might both have a side effect and return a value.

GC maps

Now that FFI calls no longer force the optimizer to sync all live values to the data and retain stacks, it can happen that SSA values are live across an FFI call. These values get spilled to the call stack by the register allocator. Spilling to the call stack is cheaper than spilling to the data and retain stacks, because floating point values and integers do not need to be boxed, and since the spilling is done later in the optimization process, optimizations can proceed across the call site instead of being stumped by pushes and pops on either side.

However, since FFI calls can invoke callbacks, which in turn run Factor code, which can trigger a garbage collection, the garbage collector must be able to identify spill slots in call frames which contain tagged pointers.

The technique I'm using is to record "GC maps". This idea comes from a paper titled Compiler Support for Garbage Collection in a Statically Typed Language (Factor is dynamically typed, and the paper itself doesn't have anything specific to static typing in it, so I found the title a bit odd). The Java HotSpot VM uses the same technique. The basic idea is that for every call site, you record a bitmap indicating which spill slots contain tagged pointers. This information is then stored in a per-code-block map, where the keys are return addresses and the values are these bitmaps.

In the future, I intend on using GC maps at call sites of Factor words as well, instead of spilling temporary values to the retain stack; then I can eliminate the retain stack altogether, freeing up a register. After this is done the data stack will only be used to pass parameters between words, and not to store temporaries within a word. This will allow more values to be unboxed in more situations, and it will improve accuracy of compiler analyses.

In fact, getting GC maps worked out was my primary motivation for this FFI rewrite; the code cleanups and performance improvements were just gravy.

Callback support improvements

Another one of those things that only makes sense when you look at how Factor evolved is that the body of an FFI callback was compiled with the non-optimizing compiler, rather than the optimizing compiler. It used to be that only certain definitions could be optimized, because static stack effects were optional and there were many common idioms which did not have static stack effects. It has been more than a year since I undertook the engineering effort to make the compiler enforce static stack safety, and the implementation of callbacks was the last vestigial remnant from the bad old days.

This design made callbacks harder to debug than they should be; if you used up too many parameters, or forgot to push a return value, you'd be greeted with a runtime error instead of a compiler error. Now this has been fixed, and callbacks are fully compiled with the optimizing compiler.

Friday, July 02, 2010

Factor talk in Boston, July 26th

I will be presenting Factor at the Boston Lisp Users' Group on July 26th, 2010. Details in François-René Rideau's announcement. I'm also going to be giving a quick talk at the Emerging Languages Camp in Portland, July 22nd. Unfortunately registration for this camp is already full.

Saturday, May 29, 2010

Comparing Factor's performance against V8, LuaJIT, SBCL, and CPython

Together with Daniel Ehrenberg and Joe Groff, I'm writing a paper about Factor for DLS2010. We would appreciate feedback about the draft version of the paper. As part of the paper we include a performance comparison between Factor, V8, LuaJIT, SBCL, and Python. The performance comparison consists of some benchmarks from the The Computer Language Benchmarks Game. I'm posting the results here first, in case there's something really stupid here.

Language implementations

Factor and V8 were built from their respective repositories. SBCL is version 1.0.38. LuaJIT is version 2.0.0beta4. CPython is version 3.1.2. All language implementations were built as 64-bit binaries and run on an 2.4 GHz Intel Core 2 Duo.

Benchmark implementations

Factor implementations of the benchmarks can be found in our source repository:

Implementations for the other languages can be found at the language benchmark game CVS repository:

	LuaJIT	SBCL	V8	CPython
binary-trees	binarytrees.lua-2.lua	binarytrees.sbcl	binarytrees.javascript	binarytrees.python3-6.python3
fasta	fasta.lua	fasta.sbcl	fasta.javascript-2.javascript	fasta.python3-2.python3
knucleotide	knucleotide.lua-2.lua	knucleotide.sbcl-3.sbcl	knucleotide.javascript-3.javascript	knucleotide.python3-4.python3
nbody	nbody.lua-2.lua	nbody.sbcl	nbody.javascript	nbody.python3-4.python3
regex-dna		regexdna.sbcl-3.sbcl	regexdna.javascript	regexdna.python3
reverse-complement	revcomp.lua	revcomp.sbcl	revcomp.javascript-2.javascript	revcomp.python3-4.python3
spectral-norm	spectralnorm.lua	spectralnorm.sbcl-3.sbcl	spectralnorm.javascript	spectralnorm.python3-5.python3

In order to make the reverse complement benchmark work with SBCL on Mac OS X, I had to apply this patch; I don't understand why:

--- bench/revcomp/revcomp.sbcl 9 Feb 2007 17:17:26 -0000 1.4
+++ bench/revcomp/revcomp.sbcl 29 May 2010 08:32:19 -0000
@@ -26,8 +26,7 @@
 
 (defun main ()
   (declare (optimize (speed 3) (safety 0)))
-  (with-open-file (in "/dev/stdin" :element-type +ub+)
-    (with-open-file (out "/dev/stdout" :element-type +ub+ :direction :output :if-exists :append)
+  (let ((in sb-sys:*stdin*) (out sb-sys:*stdout*))
       (let ((i-buf (make-array +buffer-size+ :element-type +ub+))
             (o-buf (make-array +buffer-size+ :element-type +ub+))
             (chunks nil))
@@ -72,4 +71,4 @@
                         (setf start 0)
                         (go read-chunk))))
            end-of-input
-             (flush-chunks)))))))
+             (flush-chunks))))))

Running the benchmarks

I used Factor's deploy tool to generate minimal images for the Factor benchmarks, and then ran them from the command line:

./factor -e='USE: tools.deploy "benchmark.nbody-simd" deploy'
time benchmark.nbody-simd.app/Contents/MacOS/benchmark.nbody-simd

For the scripting language implementations (LuaJIT and V8) I ran the scripts from the command line:

time ./d8 ~/perf/shootout/bench/nbody/nbody.javascript -- 1000000
time ./src/luajit ~/perf/shootout/bench/nbody/nbody.lua-2.lua 1000000

For SBCL, I did what the shootout does, and compiled each file into a new core:

ln -s ~/perf/shootout/bench/nbody/nbody.sbcl .

cat > nbody.sbcl_compile <<EOF
(proclaim '(optimize (speed 3) (safety 0) (debug 0) (compilation-speed 0) (space 0)))
(handler-bind ((sb-ext:defconstant-uneql (lambda (c) (abort c))))
  (load (compile-file "nbody.sbcl" )))
(save-lisp-and-die "nbody.core" :purify t)
EOF

sbcl --userinit /dev/null --load nbody.sbcl_compile

cat > nbody.sbcl_run <<EOF
(proclaim '(optimize (speed 3) (safety 0) (debug 0) (compilation-speed 0) (space 0)))
(main) (quit)
EOF

time sbcl --dynamic-space-size 500 --noinform --core nbody.core --userinit /dev/null --load nbody.sbcl_run 1000000

For CPython, I precompiled each script into bytecode first:

python3.1 -OO -c "from py_compile import compile; compile('nbody.python3-4.py')"

Benchmark results

All running times are wall clock time from the Unix time command. I ran each benchmark 5 times and used the best result.

	Factor	LuaJIT	SBCL	V8	CPython
fasta	2.597s	1.689s	2.105s	3.948s	35.234s
reverse-complement	2.377s	1.764s	2.955s	3.884s	1.669s
nbody	0.393s	0.604s	0.402s	4.569s	37.086s
binary-trees	1.764s	6.295s	1.349s	2.119s	19.886s
spectral-norm	1.377s	1.358s	2.229s	12.227s	1m44.675s
regex-dna	0.990s	N/A	0.973s	0.166s	0.874s
knucleotide	1.820s	0.573s	0.766s	1.876s	1.805s

Benchmark analysis

Some notes on the results:

There is no Lua implementation of the regex-dna benchmark.
Some of the SBCL benchmark implementations can make use of multiple cores if SBCL is compiled with thread support. However, by default, thread support seems to be disabled on Mac OS X. None of the other language implementations being tested have native thread support, so this is a single-core performance test.
Factor's string manipulation still needs work. The fasta, knucleotide and reverse-complement benchmarks are not as fast as they should be.
The binary-trees benchmark is a measure of how fast objects can be allocated, and how fast the garbage collector can reclaim dead objects. LuaJIT loses big here, perhaps because it lacks generational garbage collection, and because Lua's tables are an inefficient object representation.
The regex-dna benchmark is a measure of how efficient the regular expression implementation is in the language. V8 wins here, because it uses Google's heavily-optimized Irregexp library.
Factor beats the other implementations on the nbody benchmark because it is able to make use of SIMD.
For some reason SBCL is slower than the others on spectral-norm. It should be generating the same code.
The benchmarks exercise insufficiently-many language features. Any benchmark that uses native-sized integers (for example, an implementation of the SHA1 algorithm) would shine on SBCL and suffer on all the others. Similarly, any benchmark that requires packed binary data support would shine on Factor and suffer on all the others. However, the benchmarks in the shootout mostly consist of scalar floating point code, and text manipulation only.

Conclusions

Factor's performance is coming along nicely. I'd like to submit Factor to the computer language shootout soon. Before doing that, we need a Debian package, and the deploy tool needs to be easier to use from the command line.