Saturday, September 18, 2010

Factor 0.94 now available

Factor 0.94 is now available from the Factor website, five months after the previous release, Factor 0.93. Binaries are provided for 10 platforms.

As usual, contributors did most of the work. Thanks to Daniel Ehrenberg, Dmitry Shubin, Doug Coleman, Erik Charlebois, Joe Groff, John Benediktsson, Jose A. Ortega Ruiz, Niklas Waern, Samuel Tardieu, Sascha Matzke and everyone else who helped out this time around!

Incompatible changes:

  • The PowerPC architecture is no longer supported. (Slava Pestov)
  • The require-when word now supports dependencies on multiple vocabularies. (Daniel Ehrenberg)
  • The C-ENUM: word in the C library interface has been replaced with ENUM:, a much improved word for defining type-safe enumerations. (Erik Charlebois, Joe Groff)
  • Tuple slot setter words with stack effect ( value object -- ) are now named foo<< instead of (>>foo). Most code is unaffected since it uses the >>foo form. (Slava Pestov)
  • The older system-micros word, which returned microseconds since the Unix epoch as an integer, has been removed. For a while, the recommended way to get the current time has been to call the now word from the calendar vocabulary, which returned a timestamp instance. (Doug Coleman)
  • A few sequence-related words were moved from the generalizations vocabulary to sequences.generalizations. (Slava Pestov)
  • The alarms vocabulary has been renamed to timers to better explain its true purpose, with improved timing accuracy and robustness. (Doug Coleman)
  • cocoa.subclassing: the syntax for defining new Objective-C classes has been changed to improve readability. (Slava Pestov)
  • io.streams.limited: the ability to throw errors on EOF was extracted from limited streams, and limited streams simplified as a result. Throwing on EOF is now implemented by the io.streams.throwing vocabulary. (Doug Coleman)

New libraries:

  • boyer-moore: efficient text search algorithm (Dmitry Shubin)
  • checksums.internet: implementation of checksum algorithm used by ICMP for the checksums framework (John Benediktsson)
  • gdbm: disk-based hash library binding (Dmitry Shubin)
  • io.encodings.detect: binary file/text encoding detection heuristics from jEdit (Joe Groff)
  • javascriptcore: FFI to the WebKit JavaScript engine (Doug Coleman)
  • lua: FFI to the Lua scripting language (Erik Charlebois)
  • oauth: minimal implementation of client-side OAuth (Slava Pestov)
  • sequences.unrolled: efficient unrolled loops with constant iteration count (Joe Groff)

Improved libraries:

Compiler improvements:

  • Improved instruction selection, copy propagation, representation selection, and register allocation; details in a blog post. (Slava Pestov)
  • An instruction scheduling pass now runs prior to register allocation, intended to reduce register pressure by moving uses closer to definitions; details in a blog post. (Daniel Ehrenberg)
  • The code generation for the C library interface has been revamped; details in a blog post. (Slava Pestov)
  • Something similar to what C++ and C# programmers refer to as "value types"; binary data can now be allocated on the call stack, and passed to C functions like any other pointer. The with-out-parameters combinator replaces tricky code for allocating and juggling multiple temporary byte arrays used as out parameters for C function calls, making this idiom easier to read and more efficient. The with-scoped-allocation combinator presents a more general, lower-level interface. (Slava Pestov)
  • The compiler can now use the x87 floating point unit on older CPUs where SSE2 is not available. However, this is not recommended, because the build farm does not test Factor (or build any binaries) in x87 mode, so this support can break at any time. To use x87 code generation, you must download the source code and bootstrap Factor yourself, on a CPU without SSE2. (Slava Pestov)

Miscellaneous improvements:

  • fuel: Factor's Ultimate Emacs Library has seen many improvements, and also some keyboard shortcuts have changed; see the README. (Erik Charlebois, Dmitry Shubin, Jose A. Ortega Ruiz)
  • A new factor.cmd script is now included in the build-support directory, to automate the update/build/bootstrap cycle for those who build from source. Its functionality is a subset of the factor.sh script for Unix. (Joe Groff)
  • The default set of icons shipped in misc has been tweaked, with better contrast and improved appearance when scaled down. (Joe Groff)

Saturday, September 11, 2010

An overview of Factor's I/O library

Factor has grown a very powerful and high-level I/O library over the years, however it is easy to get lost in the forest of reference documentation surrounding the io vocabulary hierarchy. In this blog post I'm attempting to give an overview of the functionality available, with some easy-to-digest examples, along with links for futher reading. I will also touch upon some common themes that come up throughout the library, such as encoding support, timeouts, and uses for dynamically-scoped variables.

Factor's I/O library is the work of many contributors over the years. Implementing FFI bindings to native I/O APIs, developing high-level abstractions on top, and making the whole thing cross-platform takes many people. In particular Doug Coleman did a lot of heavy lifting early on for the Windows port, and also implemented several new cross-platform features such as file system metadata and memory mapped files.

Our I/O library is competitive with Python's APIs and Java's upcoming NIO2 in breadth of functionality. I like to think the design is quite a bit cleaner too, because instead of being a thin wrapper over POSIX we try to come up with clear and conherent APIs that make sense on both Windows and Unix.

First example: converting a text file from MacRoman to UTF8

The io.files vocabulary defines words for reading and writing files. It supports two modes of operation in a pretty standard fashion:

What makes Factor's file I/O interesting is that it takes advantage of pervasive support for I/O encoding. In Factor, a string is not a sequence of bytes; it is a sequence of Unicode code points. When reading and writing strings on external resources, which only consist of bytes, an encoding parameter is given to specify the conversion from strings to byte arrays.

Let's convert foo.txt from MacRoman, an older encoding primarily used by classic Mac OS, to UTF8:

USING: io.encodings.8-bit.mac-roman io.encodings.utf8 io.files ;

"foo.txt" mac-roman file-contents
"foo.txt" utf8 set-file-contents

This is a very simple and concise implementation but it has the downside that the entire file is read into memory. For most small text files this does not matter, but if efficiency is a concern then we can do the conversion a line at a time:

USING: io io.encodings.8-bit.mac-roman io.encodings.utf8
io.files ;

"out.txt" utf8 [
"in.txt" mac-roman [
[ print ] each-line
] with-file-reader
] with-file-writer

Converting a directory full of files from MacRoman to UTF8

The io.files vocabulary defines words for listing and modifying directories. Let's make the above example more interesting by performing the conversion on a directory full of files:

USING: io.directories io.encodings.8-bit.mac-roman
io.encodings.utf8 io.files ;

: convert-directory ( path -- )
[
[
[ mac-roman file-contents ] keep
utf8 set-file-contents
] each
] with-directory-files ;

An aside: generalizing the "current working directory"

If you run the following, you will see that with-directory-files returns relative, and not absolute, file names:

"/path/to/some/directory"
[ [ print ] each ] with-directory-files

So the question is, how did file-contents above know what directory to look for files in? The answer is that in addition to calling the quotation with the directory's contents, the with-directory-files word also rebinds the current-directory dynamic variable.

This directory is the Factor equivalent of the familiar Unix notion of "current working directory". It generalizes the Unix feature by making it dynamically-scoped; within the quotation passed to the with-directory combinator, relative paths are resolved relative to that directory, but other coroutines executing at the time, or code after the quotation, is unaffected. This functionality is implemented entirely at the library level; all pathname strings are "normalized" with the normalize-pathname word before being handed off to the operating system.

When calling a shell command with io.launcher, the child process is run from the Factor current-directory so relative pathnames passed on the command line will just work. However, when making C FFI calls which take pathnames, you pass in absolute paths only, or normalize the path with normalize-path first, otherwise the C code wlll search for it in the wrong place.

Checking free disk space

The io.files.info vocabulary defines two words which return tuples containing information about a file, and the file system containing the file, respectively:

Let's say your application needs to install some files in the user's home directory, but instead of failing half-way through in the event that there is insufficient space, you'd rather display a friendly error message upfront:

ERROR: buy-a-new-disk ;

: gb ( m -- n ) 30 2^ * ;

: check-space ( -- )
home file-system-info free-space>> 10 gb <
[ buy-a-new-disk ] when ;

Now if there is less than 10gb available, the check-space word will throw a buy-a-new-disk error.

The file-system-info word reports a bunch of other info. There is a Factor implementation of the Unix df command in the tools.files vocabulary:

( scratchpad ) file-systems.
+device-name+ +available-space+ +free-space+ +used-space+ +total-space+ +percent-used+ +mount-point+
/dev/disk0s2 15955816448 16217960448 183487713280 199705673728 91 /
fdesc 0 0 1024 1024 100 /dev
fdesc 0 0 1024 1024 100 /dev
map -hosts 0 0 0 0 0 /net
map auto_home 0 0 0 0 0 /home
/dev/disk1s2 15922262016 15922262016 383489052672 399411314688 96 /Users/slava

Doug has two blog posts about these features, part 1 and part 2.

Unix only: symbolic links

Factor knows about symbolic links on Unix. The io.files.links vocabulary defines a pair of words, make-link and make-hard-link. The link-info word is like file-info except it doesn't follow symbolic links. Finally, the directory hierarchy traversal words don't follow links, so a link cycle or bogus link to / somewhere won't break everything.

File system monitoring

The io.monitors vocabulary implements real-time file and directory change monitoring. Unfortunately at this point in time, it is only supported on Windows, Linux and Mac. Neither one of FreeBSD and OpenBSD exposes the necessary information to user-space.

Here is an example for watching a directory for changes, and logging them:

USE: io.monitors

: watch-loop ( monitor -- )
dup next-change path>> print flush watch-loop ;

: watch-directory ( path -- )
[ t [ watch-loop ] with-monitor ] with-monitors ;

Try pasting the above code into a Factor listener window, and then run home watch-directory. Every time a file in your home directory is modified, its full pathname will be printed in the listener.

Java will only begin to support symbolic links and directory monitoring in the upcoming JDK7 release.

Memory mapped files

The io.mmap vocabulary defines support for working with memory-mapped files. The highest-level and easiest to use interface is the with-mapped-array combinator. It takes a file name, a C type, and a quotation. The quotation can perform generic sequence operations on the mapped file.

Here is an example which reverses each group of 4 bytes:

USING: alien.c-types grouping io.mmap sequences
specialized-arrays ;
SPECIALIZED-ARRAY: char

"mydata.dat" char [
4 <sliced-groups>
[ reverse! drop ] each
] with-mapped-array

The <sliced-groups> word returns a view of an underlying sequence, grouped into n-element subsequences. Mutating one of these subsequences in-place mutates the underlying sequence, which in our case is a mapped view of a file.

A more efficient implementation of the above is also possible, by mapping in the file as an int array and then performing bitwise arithmetic on the elements.

Launching processes

Factor's io.launcher vocabulary was originally developed for use by the build farm. The build farm needs to launch processes with precise control over I/O redirection and timeouts, and so a rich set of cross-platform functionality was born.

The central concept in the library is the process, tuple, constructed by calling <process>. Various slots of the process tuple can be filled in to specify the command line, environment variables, redirection, and so on. Then the process can be run in various ways, running in the foreground, in the background, or with input and output attached to Factor streams.

The launcher's I/O redirection is very flexible. If you don't touch the redirection slots in a process tuple, the subprocess will just inherit the current standard input and output. You can specify a file name to read or write from, a file name to append to, or even supply a pipe object, constructed from the io.pipes vocabulary.

<process>
"rotate-logs" >>command
+closed+ >>stdin
"out.txt" >>stdout
"error.log" <appender> >>stderr

It is possible to specify a timeout when running a process:

<process>
{ "ssh" "myhost" "-l" "jane" "do-calculation" } >>command
15 minutes >>timeout
"results.txt" >>stdout
run-process
The process will be killed if it runs for longer than the timeout period. Many other features are supported; setting environment variables, setting process priority, and so on. The io.launcher vocabulary has all the details.

Support for timeouts is a cross-cutting concern that touches many ports of the I/O API. This support is consolidated in the io.timeouts vocabulary. The set-timeout generic word is supported by all external resources which provide interruptible blocking operations.

Timeouts are implemented on top of our monotonic timer support; changing your system clock while Factor is running won't screw with active network connections.

Unix only: file ownership and permissions

The io.files.unix vocabulary defines words for reading and writing file ownership and permissions. Using this vocabulary, we can write a shell script to a file, make it executable, and run it. An essential component of any multi-language quine:

USING: io.encodings.ascii io.files io.files.info.unix
io.launcher ;

"""
#!/bin/sh
echo "Hello, polyglot"
""" "script.sh" ascii set-file-contents
OCT: 755 "script.sh" set-file-permissions
"./script.sh" run-process

There are even more Unix-specific words in the unix.users and unix.groups vocabularies. Using these words enables listing all users on the system, converting user names to UIDs and back, and even setuid and setgid.

Networking

Factor's io.sockets vocabulary supports stream and packet-based networking.

Network addresses are specified in a flexible manner. Specific classes exist for IPv4, IPv6 and Unix domain socket addressing. When a network socket is constructed, that endpoint is bound to a given address specifier.

Connecting to http://www.apple.com, sending a GET request, and reading the result:

USING: io io.encodings.utf8 io.sockets ;

"www.apple.com" 80 <inet> utf8 [
"""GET / HTTP/1.1\r
host: www.apple.com\r
connection: close\r
\r
""" write flush
contents
] with-client
print

SSL support is almost transparent; the only difference is that the address specifier is wrapped in <secure>:

USING: io io.encodings.utf8 io.sockets
io.sockets.secure ;

"www.cia.gov" 443 <inet> <secure> utf8 [
"""GET / HTTP/1.1\r
host: www.cia.gov\r
connection: close\r
\r
""" write flush
contents
] with-client
print

For details, see the io.sockets.secure documentation, and my blog post about SSL in Factor..

Of course you'd never send HTTP requests directly using sockets; instead you'd use the http.client vocabulary.

Network servers

Factor's io.servers.connection vocabulary is so cool, that a couple of years back I made a screencast about it. Nowadays the sample application developed in that screencast is in the extra/time-server; the implementation is very concise and elegant.

Under the hood

All of this functionality is implemented in pure Factor code on top of our excellent C library interface and extensive bindings to POSIX and Win32 in the unix and windows vocabulary hierarchies, respectively.

As much as possible, I/O is performed with non-blocking operations; synchronous reads and writes only suspend the current coroutine and switch to the next runnable one rather than hanging the entire VM. I recently rewrote the coroutines implementation to use direct context switching rather than continuations.

Co-ordination and scheduling of coroutines is handled with a series of simple concurrency abstractions.

Sunday, September 05, 2010

Making Factor's continuous build system more robust

I've done some work on Factor's continuous build system over the weekend to make it more robust in the face of failure, with improved error reporting and less manual intervention required to fix problems when they come up. The current build system is called "mason", because it is based on an earlier build system named "builder" that was written by Eduardo Cavazos. Every binary package you download from factorcode.org was built, tested and benchmarked by mason.

Checking for disk space

Every once in a while build machines run out of disk space. This is a condition that Git doesn't handle very gracefully; if a git pull fails half way through, it leaves the repository in an inconsistent state. Instead of failing during source code checkout or testing, mason now checks disk usage before attempting a build. If less than 1 Gb is free, it sends out a warning e-mail and takes no further action. Disk usage is also now part of every build report; for example, take a look at the latest Mac OS X report. Finally, mason does a better job of cleaning up after itself when builds fail, reducing the rate of disk waste overall.

I must say the disk space check was very easy to implement using Doug Coleman's excellent cross-platform file-system-info library. Factor's I/O libraries are top-notch and everything works as expected across all of the platforms we run on.

Git repository corruption

Git is not 100% reliable, and sometimes repositories will end up in a funny state. One cause is when the disk fills up in the middle of a pull, but it seems to happen in other cases too. For example, just a few days ago, our 64-bit Windows build machine started failing builds with the following error:

From git://factorcode.org/git/factor
* branch master -> FETCH_HEAD
Updating d386ea7..eece1e3

error: Entry 'basis/io/sockets/windows/windows.factor' not uptodate. Cannot merge.

Of course nobody actually edits files in the repository in question, its a clone of the official repo that gets updated every 5 minutes. Why git messed up here I have no clue, but instead of expecting software to be perfect, we can design for failure.

If a pull fails with a merge error, or if the working copy somehow ends up containing modified or untracked files, mason deletes the repository and clones it again from scratch, instead of just falling over and requiring manual intervention.

Error e-mail throttling

Any time mason encounters an error, such as not being able to pull from the Factor Git repository, disk space exhaustion, or intermittent network failure, it sends out an e-mail to Factor-builds. Since it checks for new code every 5 minutes, this can get very annoying if there is a problem with the machine and nobody is able to fix it immediately; the Factor-builds list would get spammed with hundreds of duplicate messages. Now, mason uses a heuristic to limit the number of error e-mails sent out. If two errors are sent within twenty minutes of each other, no further errors are sent for another 6 hours.

More robust new code check

Previously mason would initiate a build if a git pull had pulled in patches. This was insufficient though, because if a build was killed half way through, for example due to power failure or machine reboot, it would not re-attempt a build when it came back up until new patches were pushed. Now mason compares the latest git revision with the last one that was actually built to completion (whether or not there were errors).

Build system dashboard

I've put together a simple dashboard page showing build system status. Sometimes VMs will crash (FreeBSD is particularly flaky when running under VirtualBox, for example) and we don't always notice that a VM is down until several days after, when no builds are being uploaded. Since mason now sends out heartbeats every 5 minutes to a central server, it was easy to put together a dashboard showing which machines have not sent any heartbeats for a while. These machines are likely to be down. The dashboard also allows a build to be forced even if no new code was pushed to the repository; this is useful to test things out after changing machine configuration.

The dashboard nicely complements my earlier work on live status display for the build farm.

Conclusion

I think mason is one of the most advanced continuous integration systems among open source language implementations, nevermind the less mainstream ones such as Factor. And thanks to Factor's advanced libraries, it is only 1600 lines of code. Here is a selection of the functionality from Factor's standard library used by mason:

  • io.files.info - checking disk usage
  • io.launcher - running processes, such as git, make, zip, tar, ssh, and of course the actual Factor instance being tested
  • io.timeouts - timeouts on network operations and child processes are invaluable; Factor's consistent and widely-used timeout API makes it easy
  • http.client - downloading boot images, making POST requests to builds.factorcode.org for the live status display feature
  • smtp - sending build report e-mails
  • twitter - Tweeting binary upload notifications
  • oauth - yes, Factor has a library to support the feared OAuth. Everyone complains about how hard OAuth is, but if you have easy to use libraries for HMAC, SHA1 and HTTP then it's no big deal at all.
  • xml.syntax - constructing HTML-formatted build reports using XML literal syntax

Friday, September 03, 2010

Two things every Unix developer should know

Unix programming can be tricky. There are many subtleties many developers are not aware of. In this post, I will describe just two of them... my favorite Unix quirks, if you will.

Interruptible system calls

On Unix, any system call which blocks can potentially fail with an errno of EINTR, which indicates that the caller must retry the system call. The EINTR error can be raised at any time for any reason, so essentially every I/O operation on a Unix system must be prepared to handle this error properly. Surprisingly to some, this includes the C standard library functions such as fread(), fwrite(), and so on.

For example, if you are writing a network server, then most of the time, you want to ignore the SIGPIPE signal which is raised when the client closes its end of a socket. However, this ignored signal can cause some pending I/O in the server to return EINTR.

A commonly-held belief is that setting the SA_RESTART flag with the sigaction() system call means that if that signal is delivered, system calls are restarted for you and EINTR doesn't need to be handled. Unfortunately this is not true. The reason is that certain signals are unmaskable. For instance, on Mac OS X, if your process is blocking reading on standard input, and the user suspends the program by sending it SIGSTOP (usually by pressing ^Z in the terminal), then upon resumption, your read() call will immediately fail with EINTR.

Don't believe me? The Mac OS X cat program is not actually interrupt-safe, and has this bug. Run cat with no arguments in a terminal, press ^Z, then type %1, and you'll get an error from cat!

$ cat
^Z
[1]+ Stopped cat
$ %1
cat
cat: stdin: Interrupted system call
$

As far as I'm aware, Factor properly handles interruptible system calls, and has for a while now, thanks to Doug Coleman explaining the issue to me 4 years ago. Not having to deal with crap like this (not to mention being able to write cross-platform code that runs on both Unix and Windows) is one of the advantages of using a high-level language like Factor or Java over C.

Subprocesses inherit semi-random things from the parent process

When you fork() your process, various things are copied from the parent to the child; environment variables, file descriptors, the ignored signal mask, and so on. Less obvious is the fact that exec() doesn't reset everything. If shared file descriptors such as stdin and stdout were set to non-blocking in the parent, the child will start with these descriptors non-blocking also, which will most likely break most programs. I've blogged about this problem before.

A similar issue is that if you elect to ignore certain signals with the SIG_IGN action using sigaction(), then subprocesses will inherit this behavior. Again, this can break processes. Until yesterday, Factor would ignore SIGPIPE using this mechanism, and child processes spawned with the io.launcher vocabulary that expected to receive SIGPIPE would not work properly. There are various workarounds; you can reset the signal mask before the exec() call, or you can do what I did in Factor, and ignore the signal by giving it an empty signal handler instead of a SIG_IGN action.

Wednesday, July 28, 2010

Overhauling Factor's C library interface

For the last while I've been working on an overhaul of Factor's C library interface and associated compiler infrastructure. The goals of the rewrite were three-fold:

  • Improving performance
  • Cleaning up some crusty old code that dates back to my earliest experiments with native code generation
  • Laying the groundwork for future compiler improvements

These changes were all behind-the-scenes; I did not introduce any new functionality or language changes, and I think Factor's FFI is already quite stable and feature-complete.

The previous FFI implementation

I started work on Factor's C library interface (FFI) almost immediately after I bootstrapped the native implementation off of JFactor. I began experimenting with an SDL-based UI early on and immediately decided I wanted to have a real FFI, instead of extending the VM with primitives which wrap C functions by hand.

Over time, both the FFI and compiler have evolved in parallel, but there was little integration between them, other than the fact that both used the same assembler DSL under the hood. The result is that FFI calls were generated in a somewhat inefficient way. Since the optimizing compiler had no knowledge of how they work, it would save all values to the data stack first. Then the FFI code generator would take over; it would pop the input parameters one by one, pass them to a per-type C function in the Factor VM which unboxed the value, then stored the value in the stack frame. Finally, the target C function would be invoked, then another C function in the VM would box the return value, and the return value would be pushed to the stack. The optimizing compiler would then take over, possibily generating code to pop the value from the stack.

The redundant stack traffic was wasteful. In some cases, the optimizing compiler would generate code to box a value and push it to the stack, only to have the FFI then generate code to pop it and unbox it immediately after. To make matters worse, over time the optimizing compiler gained the ability to box and unbox values with open-coded assembly sequences, but the FFI would still make calls into the VM to do it.

All in all, it was about time I rewrote the FFI, modernizing it and integrating it with the rest of the compiler in the process.

Factoring FFI calls into simpler instructions

Most low-level IR instructions are very simple; FFI calls used to be the exception. Now I've split these up into smaller instructions. Parameters and return values are now read and written from the stack with the same ##peek and ##replace instructions that everything else uses, and boxing and unboxing parameters and return values is now done with the representation selection pass. A couple of oddball C types, such as long long on x86-32, still require VM calls to box and unbox, and I added new instructions for those.

One slightly tricky thing that came up was that I had to re-engineer low-level IR to support instructions with multiple output values. This is required for C calls which return structs and long long types by value, since each word-size chunk of the return value is returned in its own register, and these chunks have to be re-assembled later. In the future, I will be able to use this support to add instructions such as the x86 division instruction, which computes x / y and x mod y simultaneously.

I also had to change low-level IR to distinguish between instructions with side effects and those without. Previously, optimization passes would assume any instruction with an output value did not have side effects, and could be deleted if the output value was not used. This is no longer true for C calls; a C function might both have a side effect and return a value.

GC maps

Now that FFI calls no longer force the optimizer to sync all live values to the data and retain stacks, it can happen that SSA values are live across an FFI call. These values get spilled to the call stack by the register allocator. Spilling to the call stack is cheaper than spilling to the data and retain stacks, because floating point values and integers do not need to be boxed, and since the spilling is done later in the optimization process, optimizations can proceed across the call site instead of being stumped by pushes and pops on either side.

However, since FFI calls can invoke callbacks, which in turn run Factor code, which can trigger a garbage collection, the garbage collector must be able to identify spill slots in call frames which contain tagged pointers.

The technique I'm using is to record "GC maps". This idea comes from a paper titled Compiler Support for Garbage Collection in a Statically Typed Language (Factor is dynamically typed, and the paper itself doesn't have anything specific to static typing in it, so I found the title a bit odd). The Java HotSpot VM uses the same technique. The basic idea is that for every call site, you record a bitmap indicating which spill slots contain tagged pointers. This information is then stored in a per-code-block map, where the keys are return addresses and the values are these bitmaps.

In the future, I intend on using GC maps at call sites of Factor words as well, instead of spilling temporary values to the retain stack; then I can eliminate the retain stack altogether, freeing up a register. After this is done the data stack will only be used to pass parameters between words, and not to store temporaries within a word. This will allow more values to be unboxed in more situations, and it will improve accuracy of compiler analyses.

In fact, getting GC maps worked out was my primary motivation for this FFI rewrite; the code cleanups and performance improvements were just gravy.

Callback support improvements

Another one of those things that only makes sense when you look at how Factor evolved is that the body of an FFI callback was compiled with the non-optimizing compiler, rather than the optimizing compiler. It used to be that only certain definitions could be optimized, because static stack effects were optional and there were many common idioms which did not have static stack effects. It has been more than a year since I undertook the engineering effort to make the compiler enforce static stack safety, and the implementation of callbacks was the last vestigial remnant from the bad old days.

This design made callbacks harder to debug than they should be; if you used up too many parameters, or forgot to push a return value, you'd be greeted with a runtime error instead of a compiler error. Now this has been fixed, and callbacks are fully compiled with the optimizing compiler.

Friday, July 02, 2010

Factor talk in Boston, July 26th

I will be presenting Factor at the Boston Lisp Users' Group on July 26th, 2010. Details in François-René Rideau's announcement. I'm also going to be giving a quick talk at the Emerging Languages Camp in Portland, July 22nd. Unfortunately registration for this camp is already full.

Saturday, May 29, 2010

Comparing Factor's performance against V8, LuaJIT, SBCL, and CPython

Together with Daniel Ehrenberg and Joe Groff, I'm writing a paper about Factor for DLS2010. We would appreciate feedback about the draft version of the paper. As part of the paper we include a performance comparison between Factor, V8, LuaJIT, SBCL, and Python. The performance comparison consists of some benchmarks from the The Computer Language Benchmarks Game. I'm posting the results here first, in case there's something really stupid here.

Language implementations

Factor and V8 were built from their respective repositories. SBCL is version 1.0.38. LuaJIT is version 2.0.0beta4. CPython is version 3.1.2. All language implementations were built as 64-bit binaries and run on an 2.4 GHz Intel Core 2 Duo.

Benchmark implementations

Factor implementations of the benchmarks can be found in our source repository:

Implementations for the other languages can be found at the language benchmark game CVS repository:

LuaJITSBCLV8CPython
binary-trees binarytrees.lua-2.luabinarytrees.sbcl binarytrees.javascript binarytrees.python3-6.python3
fasta fasta.lua fasta.sbcl fasta.javascript-2.javascript fasta.python3-2.python3
knucleotide knucleotide.lua-2.luaknucleotide.sbcl-3.sbcl knucleotide.javascript-3.javascriptknucleotide.python3-4.python3
nbody nbody.lua-2.lua nbody.sbcl nbody.javascript nbody.python3-4.python3
regex-dna regexdna.sbcl-3.sbcl regexdna.javascript regexdna.python3
reverse-complement revcomp.lua revcomp.sbcl revcomp.javascript-2.javascript revcomp.python3-4.python3
spectral-norm spectralnorm.lua spectralnorm.sbcl-3.sbclspectralnorm.javascript spectralnorm.python3-5.python3

In order to make the reverse complement benchmark work with SBCL on Mac OS X, I had to apply this patch; I don't understand why:

--- bench/revcomp/revcomp.sbcl 9 Feb 2007 17:17:26 -0000 1.4
+++ bench/revcomp/revcomp.sbcl 29 May 2010 08:32:19 -0000
@@ -26,8 +26,7 @@

(defun main ()
(declare (optimize (speed 3) (safety 0)))
- (with-open-file (in "/dev/stdin" :element-type +ub+)
- (with-open-file (out "/dev/stdout" :element-type +ub+ :direction :output :if-exists :append)
+ (let ((in sb-sys:*stdin*) (out sb-sys:*stdout*))
(let ((i-buf (make-array +buffer-size+ :element-type +ub+))
(o-buf (make-array +buffer-size+ :element-type +ub+))
(chunks nil))
@@ -72,4 +71,4 @@
(setf start 0)
(go read-chunk))))
end-of-input
- (flush-chunks)))))))
+ (flush-chunks))))))

Running the benchmarks

I used Factor's deploy tool to generate minimal images for the Factor benchmarks, and then ran them from the command line:

./factor -e='USE: tools.deploy "benchmark.nbody-simd" deploy'
time benchmark.nbody-simd.app/Contents/MacOS/benchmark.nbody-simd

For the scripting language implementations (LuaJIT and V8) I ran the scripts from the command line:

time ./d8 ~/perf/shootout/bench/nbody/nbody.javascript -- 1000000
time ./src/luajit ~/perf/shootout/bench/nbody/nbody.lua-2.lua 1000000

For SBCL, I did what the shootout does, and compiled each file into a new core:

ln -s ~/perf/shootout/bench/nbody/nbody.sbcl .

cat > nbody.sbcl_compile <<EOF
(proclaim '(optimize (speed 3) (safety 0) (debug 0) (compilation-speed 0) (space 0)))
(handler-bind ((sb-ext:defconstant-uneql (lambda (c) (abort c))))
(load (compile-file "nbody.sbcl" )))
(save-lisp-and-die "nbody.core" :purify t)
EOF

sbcl --userinit /dev/null --load nbody.sbcl_compile

cat > nbody.sbcl_run <<EOF
(proclaim '(optimize (speed 3) (safety 0) (debug 0) (compilation-speed 0) (space 0)))
(main) (quit)
EOF

time sbcl --dynamic-space-size 500 --noinform --core nbody.core --userinit /dev/null --load nbody.sbcl_run 1000000

For CPython, I precompiled each script into bytecode first:

python3.1 -OO -c "from py_compile import compile; compile('nbody.python3-4.py')"

Benchmark results

All running times are wall clock time from the Unix time command. I ran each benchmark 5 times and used the best result.

FactorLuaJITSBCL V8 CPython
fasta 2.597s1.689s2.105s3.948s 35.234s
reverse-complement2.377s1.764s2.955s3.884s 1.669s
nbody 0.393s0.604s0.402s4.569s 37.086s
binary-trees 1.764s6.295s1.349s2.119s 19.886s
spectral-norm 1.377s1.358s2.229s12.227s1m44.675s
regex-dna 0.990sN/A 0.973s0.166s 0.874s
knucleotide 1.820s0.573s0.766s1.876s 1.805s

Benchmark analysis

Some notes on the results:

  • There is no Lua implementation of the regex-dna benchmark.
  • Some of the SBCL benchmark implementations can make use of multiple cores if SBCL is compiled with thread support. However, by default, thread support seems to be disabled on Mac OS X. None of the other language implementations being tested have native thread support, so this is a single-core performance test.
  • Factor's string manipulation still needs work. The fasta, knucleotide and reverse-complement benchmarks are not as fast as they should be.
  • The binary-trees benchmark is a measure of how fast objects can be allocated, and how fast the garbage collector can reclaim dead objects. LuaJIT loses big here, perhaps because it lacks generational garbage collection, and because Lua's tables are an inefficient object representation.
  • The regex-dna benchmark is a measure of how efficient the regular expression implementation is in the language. V8 wins here, because it uses Google's heavily-optimized Irregexp library.
  • Factor beats the other implementations on the nbody benchmark because it is able to make use of SIMD.
  • For some reason SBCL is slower than the others on spectral-norm. It should be generating the same code.
  • The benchmarks exercise insufficiently-many language features. Any benchmark that uses native-sized integers (for example, an implementation of the SHA1 algorithm) would shine on SBCL and suffer on all the others. Similarly, any benchmark that requires packed binary data support would shine on Factor and suffer on all the others. However, the benchmarks in the shootout mostly consist of scalar floating point code, and text manipulation only.

Conclusions

Factor's performance is coming along nicely. I'd like to submit Factor to the computer language shootout soon. Before doing that, we need a Debian package, and the deploy tool needs to be easier to use from the command line.

Monday, May 10, 2010

A collection of small compiler improvements

One of the big optimizations I'm planning on implementing in Factor is support for computations on machine-word-sized integers. The motivation is as follows. While code that operates on small integers (fixnums) does not currently allocate memory for intermediate values, and as a result can compile very efficiently if type checks are eliminated, sometimes fixnum precision is not quite enough. Using bignums in algorithms such as SHA1 that require 32-bit or 64-bit arithmetic incurs a big performance hit over writing the code in a systems language. My plan is to have the compiler support machine integers as an intermediate step between fixnums and bignums. Machine integers would be boxed and unboxed at call boundaries, but tight loops operating on them will run in registers.

While I haven't yet implemented the above optimization, I've laid the groundwork and made it possible to represent operations on untagged integers at the level of compiler IR at least. While unboxed floats do not cause any complications for the GC, unboxed integers do, because they share a register file with tagged pointers. To make Factor's precise GC work, the compiler can now distinguish registers containing tagged pointers from those containing arbitrary integer data, by breaking down the register class of integer registers into two "representations", tagged and untagged.

Since the above reworking touched many different parts of the compiler, I also took the time to implement some minor optimizations and clean up some older code. These improvements will be the topic of this post.

To understand this post better, you should probably skim the list of low-level IR instructions defined in compiler.cfg.instructions first. Also, feel free to post comments here or e-mail me questions about the design of the Factor compiler.

Improvements to value numbering

In the x86 architecture, almost any instruction can take a memory operand, and memory operands take the form (base + displacement * scale + offset), where base and displacement are registers, offset is a 32-bit signed integer, and scale is 1, 2, 4 or 8. While using a complex addressing mode is not any faster than writing out the arithmetic by hand with a temporary register used to store the final address, it can reduce register pressure and code size.

The compiler now makes better use of complex addressing modes. Prior to optimization, the low level IR builder lowers specialized array access to the ##load-memory-imm and ##store-memory-imm instructions. For example, consider the following code:

USING: alien.c-types specialized-arrays ;
SPECIALIZED-ARRAY: float

: make-data ( -- seq ) 10 iota [ 3.5 + sqrt ] float-array{ } map-as ;
The inner loop consists of the following low level IR:
##integer>float 23 14
##sqrt 24 23
##load-integer 25 2
##slot-imm 26 15 2 7
##load-integer 27 4
##mul 28 14 27
##tagged>integer 29 26
##add-imm 30 29 7
##add 31 30 28
##store-memory-imm 24 31 0 float-rep f
##load-integer 32 1
##add 33 14 32

The ##mul, ##add-imm and ##add instructions compute the address input to ##store-memory-imm. After value numbering, these three instructions have been fused with ##store-memory-imm to form ##store-memory:

##integer>float 62 57
##sqrt 63 62
##slot-imm 65 48 2 7
##tagged>integer 68 65
##store-memory 63 68 57 2 7 float-rep f
##add-imm 72 57 1

Optimistic copy propagation

While the low-level optimizer's value numbering and alias analysis passes only operate on individual basic blocks (local optimizations), the copy propagation pass is global, meaning it takes the entire procedure being compiled into account.

Eventually I will merge the local alias analysis and value numbering passes with the copy propagation pass to get an optimization known as "global value numbering with redundant load elimination". One step along this road is a rewrite that makes the copy propagation pass optimistic, in the following sense.

Copy propagation concerns itself with eliminating ##copy instructions, which correspond to assignments from one SSA value to another (new) value:

y = x

When such an instruction is processed, all usages of the value y can be replaced with the value x, in every basic block, and the definition of y deleted.

A simple extension of the above treats a ##phi instruction where all inputs are equal the same as a copy. This optimizes cases such as:

 a = ...
if(...)
{
b = a
}
c = phi(b,a)

The case of ##phi instructions where one of the inputs is carried by a back edge in the control flow graph (ie, a loop) is where the optimistic assumption comes in. Consider the following code, where the definition of x'' is carried by a back edge:

x' = 0

while(...)
{
x = phi(x',x'')
x'' = ...
}

Optimistic copy propagation is able to eliminate the phi node entirely. Switching from pessimistic to optimistic copy propagation eliminated about 10% of copy instructions. While this is not a great amount, it did reduce spilling in a few subroutines I looked at, and there is no increase in code complexity from this new approach.

Representation selection improvements

The low-level IR now includes three new instructions, ##load-double, ##load-float and ##load-vector. These are inteded for loading unboxed floating point values and SIMD vectors into vector registers without using an integer register to temporarily hold a pointer to the boxed value.

Uses of these instructions are inserted by the representation selection pass. If the destination of a ##load-reference is used in an unboxed representation, then instead of inserting a memory load following the ##load-reference, the instruction is replaced with one of the new variants.

Loads from constant addresses directly into floating point and vector registers are only possible on x86, and not PowerPC, so the optimization is not performed on the latter architecture. On x86-32, an immediate displacement can encode any 32-bit address. On x86-64, loading from an absolute 64-bit address requires an integer register, however instruction pointer-relative addressing is supported with a 32-bit displacement. To make use of this capability, the compiler now supports a new "binary literal area" for unboxed data that compiled code can reference. The unboxed data is placed immediately following a word's machine code, allowing RIP-relative addressing to be used on x86-64.

This optimization, together with value numbering improvements, helps reduce pressure on integer registers in floating point loops.

Register allocator improvements

Once representation selection has run, each SSA value has an associated representation and register class. Each representation always belongs to one specific register class; a register class is a finite set of registers from which the register allocator is allowed to choose a register for this type of value.

Before register allocation, the SSA destruction pass transforms ##phi instructions into ##copys, and then subsequently performs live range coalescing to eliminate as many of the copies as possible.

Coalescing two SSA values from different register classes does not make sense; the compiler will not be able to generate valid machine code if a single SSA value is used as both a float and an integer, since on most CPUs the integer and floating point register files are distinct.

However, coalescing values with different representations, but the same register class, is okay. Consider the case where a double-precision float is computed, and then converted into a single precision float, and subsequently stored into memory. If the double-precision value is not used after the conversion, then the same register that held the input value can be reused to store the result of the single-precision conversion.

Previously this pass only coalesced values with identical representations; now I've generalized it to coalescing values with the same register class but possibly different representations. This reduces register pressure and spilling.

The tricky part about making it work is that the register allocator needs to know the value's representation, not just its register class, for generating correct spills and reloads. For example, a spilled value with register class float-regs can use anywhere from 4 to 16 bytes in the stack frame, depending on the specific representation: single precision float, double precision float, or SIMD vector. Clearly, if coalescing simply lost this fine-grained representation information and only retained register classes, the register allocator would not have enough information to generate valid code.

The solution is to have live range coalescing compute the equivalence classes of SSA values without actually renaming any usages to point to the canonical representative. The renaming map is made available to the register allocator. Most places in the register allocator where instruction operands are examined make use of the renaming map.

Until the splitting phase, these equivalence classes are in one-to-one correspondence with live intervals, and each live interval has a single machine register and spill slot. However, when spill and reload decisions are made, the register allocator uses the original SSA values to look up representations.

If Factor's compiler did not use SSA form at all, there would still be a copy coalescing pass, and the register allocator could also support coalescing values with different representations, by first performing a dataflow analysis known as reaching definitions. This would propagate representation information to use sites.

Retaining SSA structure all the way until register allocation gives you the results of this reaching definitions analysis "for free". In both the SSA and non-SSA case, you still don't want definitions with multiple different representations reaching the same call site. The SSA equivalent of this would be a phi instruction whose inputs had different representations. The compiler ensures this doesn't happen by first splitting up the set of SSA values into strongly-connected components (SCCs) whose edges are phi nodes. The same representation is then assigned to each member of every SCC; if an agreeable representation is not found then it falls back on boxing and tagging all members.

Notice that while both representation selection and SSA destruction group values into equivalence classes, the groupings do not correspond to each other and neither one is a refinement of the other. Not all copies resulting from phi nodes get coalesced away, so a single SCC may intersect multiple coalescing classes. This can happen if the first input to a phi node is live at the definition point of the second:

b = ...

if(...)
{
a = ...
foo(b)
}
/* 'c' can be coalesced with 'a' or 'b' -- but not both.
All three values belong to the same SCC and must have the same
representation */
c = phi(a,b)
On the other hand, two values from different SCCs might also get coalesced together:

a = some-double-float
/* if 'a' is not live after this instruction, then 'a' and 'b'
can share a register. They belong to different SCCs and have
different representations */
b = double-to-single-precision-float(a)

Compiler code cleanups

In addition to just adding new optimizations I've also simplified and refactored the internals of some of the passes I worked on.

One big change is that the "machine representation" used at the very end of the compilation process is gone. Previously, after register allocation had run, the control flow graph would be flattened into a list of instructions with CFG edges replaced by labels and jumps. The code generator would then iterate over this flat representation and emit machine code for each virtual instruction. However since no optimization passes ran on the flat representation, it was generated and then immediately consumed. Combining the linearization pass with the code generation pass let me eliminate about a dozen IR instructions that were specific to the flat representation (anything dealing with labels and jumps to labels). It also sped up compilation time slightly.

Value numbering's common subexpression elimination now uses a simpler and more efficient scheme for hashed lookup of expressions. New algebraic simplifications are also easier to add now, because the def-use graph is easier to traverse.

Some instructions which could be expressed in terms of others were removed, namely the ##string-nth and ##set-string-nth-fast instructions. Reading and writing characters from strings can be done by combining simpler memory operations already present in the IR.

A number of instructions were combined. All of the instructions for reading from memory in different formats, ##alien-unsigned-1, ##alien-signed-4, ##alien-double, have been merged into a single ##load-memory instruction which has a constant operands storing the C type and representation of the data being loaded. Similarly, ##set-alien-signed-8, etc have all been merged into ##store-memory. This restructuring allows optimization passes to more easily treat all memory accesses in a uniform way.

A bug fix for alias analysis

While working on the above I found an interesting bug in alias analysis. It would incorrectly eliminate loads and stores in certain cases. The optimizer assumed that a pointer to a freshly-allocated object could never alias a pointer read from the heap. However the problem of course is that such a pointer could be stored in another object, and then read back in.

Here is a Factor example that demonstrates the problem. First, a couple of tuple definitions; the type declarations and initial value are important further on.

USING: typed locals ;

TUPLE: inner { value initial: 5 } ;
TUPLE: outer { value inner } ;

Now, a very contrieved word which takes two instances of outer,
mutates one, and reads a value out of the other:

TYPED: testcase ( x: outer y: outer -- )
inner new :> z ! Create a new instance of 'inner'
z x value<< ! Store 'z' in 'x'
y value>> :> new-z ! Load 'new-z' from 'y'
10 new-z value<< ! Store a value inside 'new-z'
z value>> ! Read a value out of 'z'
;

Note that the initial value of the value slot in the inner tuple is 5, so inner new creates a new instance holding the value 5. In the case where x and y point at the same object, new-z and z will point to the same object, and the newly-allocated object's value is set to 10. However, the compiler did not realize this, and erronously constant-folded z value>> down to 5!

This bug never came up in actual code; the facepalm moment came while I was doing some code refactoring and noticed that the above case was not being handled.

Instruction scheduling to reduce register pressure

I'm not the only one who's been busy improving the Factor compiler. One of the co-inventors of Factor has cooked up an instruction scheduler and improved method inlining. The scheduler has been merged, and the it looks like the method inlining improvements should be ready in a few days.

Some benchmarks

Here is a comparison between the April 17th build of Factor with the latest code from the GIT repository. I ran a few of the language shootout benchmarks on a 2.4 GHz MacBook Pro. Factor was built in 32-bit mode, because the new optimizations are most effective on the register-starved x86.

BenchmarkBefore (ms)After (ms)
nbody-simd415349
fasta26672485
knucleotide182177
regex-dna10986
spectral-norm16411347

Friday, April 16, 2010

Factor 0.93 now available

After two months of development, Factor 0.93 is now available for download from the Factor website. A big thanks to all the contributors, testers and users. This release would not be possible without your help. A summary of the most important changes follows:

Incompatible changes:

  • Factor no longer supports NetBSD, due to limitations in that operating system. (Slava Pestov)
  • sets, hash-sets, bit-sets: these vocabularies have been redesigned around a new generic protocol for sets (Daniel Ehrenberg)
  • Strings can no longer be used to refer to C types in FFI calls; you must use C type words instead. Also, ABI specifiers are now symbols rather than strings. So for example, the following code will no longer work:
    : my-callback ( -- callback ) "int" { "int" "int" } "cdecl" [ + ] alien-callback ;
    you must now write:
    : my-callback ( -- callback ) int { int int } cdecl [ + ] alien-callback ;
    (Joe Groff)
  • The behavior of string types has changed. The char* C type is now just a bare pointer; to get the automatic conversion to and from Factor strings, use the c-string type. See the documentation for details. (Joe Groff)
  • FFI function return values which are pointers to structs are now boxed in a struct class, instead of returning a bare alien. This means that many memory>struct calls can be removed. If the return value is actually a pointer to an array of structs, use specialized arrays as before. (Joe Groff)
  • C-ENUM: now takes an additional parameter which is either the name of a new C type to define, or f. This type is aliased to int. (Erik Charlebois)
  • The stack checker now supports row polymorphic stack effects, for improved error checking. See the documentation for details. (Joe Groff)

New features:

  • Co-operative threads now use efficient context-switching primitives instead of copying stacks with continuations (Slava Pestov)
  • Added final class declarations, to prohibit classes from being subclassed. This enables a few compiler optimizations (Slava Pestov)
  • Added platform support metadata for vocabularies. Vocabulary directories can now contain a platforms.txt file listing operating system names which they can be loaded under. (Slava Pestov)
  • The deploy tool can now deploy native libraries and resource files, and supports custom icons. (Joe Groff)
  • Joint behavior of vocabularies - the new require-when word can be used to express that a vocabulary should be loaded if two existing vocabularies have been loaded (Daniel Ehrenberg)
  • Callstack overflow is now caught and reported on all platforms except for Windows and OpenBSD. (Slava Pestov)
  • The prettyprinter now has some limits set by default. This prevents out of memory errors when printing large structures, such as the global namespace. Use the without-limits combinator to disable limits. (Slava Pestov)
  • Added fastcall and thiscall ABIs on x86-32 (Joe Groff)
  • The build farm's version release process is now more automated (Slava Pestov)
Improved libraries:
  • delegate: add BROADCAST: syntax, which delegates a generic word with no outputs to an array of multiple delegates. (Joe Groff)
  • game.input: X11 support. (William Schlieper)
  • gpu: geometry shader support. (Joe Groff)
  • opengl: OpenGL 4.0 support. (Joe Groff)
New libraries:
  • bit.ly: Factor interface to the bit.ly URL shortening service. (Slava Pestov)
  • chipmunk: binding for Chipmunk 2D physics library. (Erik Charlebois)
  • cuda: binding to NVidia's CUDA API for GPU computing. (Doug Coleman, Joe Groff)
  • cursors: experimental library for iterating over collections, inspired by STL iterators. (Joe Groff)
  • elf: parsing ELF binaries. The elf.nm vocabulary is a demo which prints all symbols in an ELF binary. (Erik Charlebois)
  • images.pbm, images.pgm, images.ppm: libraries for loading PBM, PGM and PPM images (Erik Charlebois)
  • macho: parsing Mach-O binaries. (Erik Charlebois)
  • opencl: binding for the OpenCL standard interface for GPU computing. (Erik Charlebois)
  • path-finding: implementation of A* and BFS path finding algorithms. (Samuel Tardieu)
  • windows.ddk: binding for Windows hardware abstraction layer. (Erik Charlebois)

Thursday, April 08, 2010

Frame-based structured exception handling on Windows x86-64

Factor used to use vectored exception handlers, registered with AddVectoredExceptionHandler, however vectored handlers are somewhat problematic. A vectored handler is always called prior to any frame-based handlers, so Factor could end up reporting bogus exceptions if the FFI is used to call a library that uses SEH internally. This prompted me to switch to frame-based exception handlers. Unfortunately, these are considerably more complex to use, and the implementation differs between 32-bit and 64-bit Windows.

I briefly discussed frame-based SEH on 32-bit Windows in my previous post. During the switch to 64 bits, Microsoft got rid of the old frame-based SEH implementation and introduced a new, lower-overhead approach. Instead of pushing and popping exception handlers onto a linked list at runtime, the system maintains a set of function tables, where each function table stores exception handling and stack frame unwinding information.

Normally, the 64-bit Windows function tables are written into the executable by the native compiler. However, language implementations which generate code at runtime need to be able to define new function tables dynamically. This is done with the RtlAddFunctionTable() function.

It took me a while to figure out the correct way to call this function. I found the os_windows_x86.cpp source file from Sun's HotSpot Java implementation was very helpful, and I based my code on the register_code_area() function from this file.

Factor and HotSpot only use function tables in a very simple manner, to set up an exception handler. Function tables can also be used to define stack unwinding behavior; this allows debuggers to generate backtraces, and so on. Doing that is more complicated and I don't understand how it works, so I won't attempt to discuss it here.

The RtlAddFunctionTable() function takes an array of RUNTIME_FUNCTION structures and a base address. For some unknown reason, all pointers in the structures passed to this function are 32-bit integers relative to the base address.

For a runtime compiler that does not need to perform unwinding, it suffices to map the entire code heap to one RUNTIME_FUNCTION. A RUNTIME_FUNCTION has three fields:

  • BeginAddress - the start of the function
  • EndAddress - the end of the function
  • UnwindData - a pointer to unwind data

All pointers are relative to the base address passed into RtlAddFunctionTable(). The unwind data can take various forms. For the simple case of no unwind codes and an exception handler, the following structure is used:

struct UNWIND_INFO {
UBYTE Version:3;
UBYTE Flags:5;
UBYTE SizeOfProlog;
UBYTE CountOfCodes;
UBYTE FrameRegister:4;
UBYTE FrameOffset:4;
ULONG ExceptionHandler;
ULONG ExceptionData[1];
};

The Version and Flags fields should be set to 1, the ExceptionHandler field set to a function pointer, and the rest of the fields set to 0. The exception handler pointer must be within relative to the base address, and it must also be within the memory range specified by the BeginAddress and EndAddress fields of the RUNTIME_FUNCTION structure. The exception handler has the same function signature as in the 32-bit SEH case:

LONG exception_handler(PEXCEPTION_RECORD e, void *frame, PCONTEXT c, void *dispatch)

In both Factor and HotSpot, the exception handler is a C function, however the RtlAddFunctionTable() API requires that it be within the bounds of the runtime code heap. To get around the restriction, both VMs allocate a small trampoline in the code heap which simply jumps to the exception handler, and use a pointer to the trampoline instead. Similarly, because the "pointers" in these structures are actually 32-bit integers, it helps to allocate the RUNTIME_FUNCTION and UNWIND_INFO in the code heap as well, to ensure that everything is within the same 4Gb memory range.

The above explanation probably didn't make much sense, so I invite you to check out the source code instead: os-windows-nt-x86.64.cpp.

Sunday, April 04, 2010

Switching call stacks on different platforms

User-space implementations of coroutines and green threads need to be able to switch the CPU's call stack pointer between different memory regions. Since this is inherently CPU- and OS-specific, I'll limit my discussion to CPUs and platforms that Factor supports.

System APIs for switching call stacks

  • Windows has an API for creating and switching between contexts, which it calls "fibers". The main functions to look at are CreateFiber() and SwitchToFiber(). On Windows, by far the easiest way to switch call stacks is to just use fibers.
  • Most Unix systems have a set of functions operating on ucontexts, such as makecontext(), swapcontext() and so on. On some systems, these APIs are poorly implemented and documented.
  • The C standard library functions setjmp() and longjmp() can be (ab)used to switch contexts. The jmpbuf structure stored to by setjmp() and read from by longjmp() contains a snapshot of all registers, including the call stack pointer. Once you've allocated your own call stack, you can capture a jmp_buf with setjmp(), change the call stack pointer, and then resume execution with longjmp(). As far as I know, this is completely undocumented and unsupported with every major C compiler.
  • The most direct way is to write some assembly to switch the relevant registers directly. Paradoxically, this approach is the most portable out of all of these, because while it is CPU-specific, the details are pretty much the same across most OSes, Windows being the exception. Switching call stacks on Windows is a bit more involved than just changing the ESP register, and requires updating some Windows-specific in-memory structures. However, it is still easy enough to do directly, if you do not wish to use the fiber API. I will describe the details at the end of this post.

High-level libraries for switching call stacks

A couple of existing libraries implement high-level, cross-platform abstractions using some combination of the above mechanisms.

  • libcoroutine uses fibers on Windows, and ucontext and longjmp on Unix.
  • libcoro uses a handful of Unix-specific APIs if they're available, or falls back onto some hand-coded assembly routines. The latter works on Windows.

NetBSD/x86 limitation

There is no realiable way to switch call stacks on NetBSD/x86, because of a limitation in the implementation of libpthread. Even if your program doesn't use pthreads, it might link with a library that does, such as Glib. The pthread library replaces various C standard library functions with its own versions, and since this includes some extremely common calls such as malloc(), almost nothing will work as a result.

The reason is that the NetBSD pthread library uses a silly trick to implement thread-local storage, one which unfortunately assumes that every native thread has exactly one call stack. The trick is to always allocate thread call stacks on multiples of the maximum call stack size. Then, each thread's unique ID can just be the result of masking the call stack pointer by the maximum call stack size.

Windows-specific concerns

So far, I've only worked out the details of switching call stacks on 32-bit Windows. I'll talk about 64-bit Windows in another post.

Windows has a mechanism known as Structured Exception Handling, which is a sort of scoped exception handling mechanism with kernel support. Windows uses SEH to deliver processor traps to user-space applications (illegal memory access, division by zero, etc). Some C++ compilers also use SEH to implement C++ exceptions.

On Windows, simply changing the ESP register to point at another memory region is insufficient because structured exception handling must know the current call stack's beginning and end, as well as a pointer to the innermost exception handler record.

These three pieces of information are stored in a per-thread information block, known as the TIB. The fiber switching APIs update the TIB for you, which is why Microsoft recommends their use.

The TIB

The TIB is defined by the NT_TIB struct in winnt.h:


typedef struct _NT_TIB {
struct _EXCEPTION_REGISTRATION_RECORD *ExceptionList;
PVOID StackBase;
PVOID StackLimit;
PVOID SubSystemTib;
_ANONYMOUS_UNION union {
PVOID FiberData;
DWORD Version;
} DUMMYUNIONNAME;
PVOID ArbitraryUserPointer;
struct _NT_TIB *Self;
} NT_TIB,*PNT_TIB;

The relevant fields that must be updated when switching call stacks are ExceptionList, StackBase and StackLimit.

Note that since the x86 call stack grows down, StackLimit is the start of the call stack's memory region and StackBase is the end.

Structured exception handlers

Structured exception handlers are stored in a linked list on the call stack, with the head of the list pointed at by the ExceptionList field of the TIB:

struct _EXCEPTION_REGISTRATION_RECORD
{
PEXCEPTION_REGISTRATION_RECORD Next;
PEXCEPTION_DISPOSITION Handler;
} EXCEPTION_REGISTRATION_RECORD, *PEXCEPTION_REGISTRATION_RECORD;

The exception list should be saved and restored when switching between call stacks, and a new call stack should begin with an empty list (ExceptionList set to NULL).

If the ExceptionList field of the TIB or the Next field of an exception record point outside the call stack, then the handler in question will not be called at all.

Accessing the TIB

On x86-32, the current thread's TIB is stored starting at address 0 in the segment pointed at by the FS segment register. The Self field always points at the struct itself. Since Windows uses flat addressing, the segment storing the TIB begins somewhere in the linear 32-bit address space, so an assembly routine to fetch the TIB and return a pointer to it in EAX looks like this:

fetch_tib:
mov eax,[fs:24]
ret

This assembly routine can then be called from C code, which can manipulate the TIB like any other structure.

Sample code for updating Windows-specific structures

The basis/cpu/x86/32/winnt/bootstrap.factor file defines assembly subroutines for updating the TIB when switching call stacks. These routines are invoked by the x86-32 non-optimizing compiler backend:

Tuesday, March 23, 2010

Incompatible change in Mach exception handler behavior on Mac OS X 10.6

On Mac OS X, Factor uses Mach exceptions rather than Unix signals to receive notifications of illegal memory accesses and arithmetic exceptions from the operating system. This is used to catch datastack underflows, division by zero, and the like. The code implementing this can be found in vm/mach_signal.c in the Factor repository. This file is based on code from the GNU libsigsegv project, with special permission to use it under a BSD license instead of the restrictive GPL.

It seems that as of Mac OS X 10.6, exceptions raised by child processes are now reported to the parent if the parent has an exception handler thread. This caused a problem in Factor if a child process crashed with an access violation; Factor would think the access violation occurred in Factor code, and die with an assertion failure when attempting to map the thread ID back into a Factor VM object. It seems the simplest way is to add the following clause to the catch_exception_raise() function:

if(task != mach_task_self()) return KERN_FAILURE;

This fix should be applicable to the original libsigsegv code as well, along with other projects, such as CLisp, that use this library.

Thursday, March 11, 2010

Adding a recaptcha to the Factor pastebin

Lately the Factor pastebin has been flooded with penis enlargement spam. The pastebin had its own very weak captcha that worked well enough for a while: there was a form field that was required to be left blank, and if it was not blank validation would fail. However, spammers have started working around it so we needed something stronger. I decided to solve the problem once and for all by integrating Doug Coleman's furnace.recaptcha vocabulary into the pastebin. Doing this was surprisingly easy.

First, I changed the USING: form in webapps.pastebin to reference the furnace.recaptcha vocabulary:

USING: namespaces assocs sorting sequences kernel accessors
hashtables db.types db.tuples db combinators
calendar calendar.format math.parser math.order syndication urls
xml.writer xmode.catalog validators
html.forms
html.components
html.templates.chloe
http.server
http.server.dispatchers
http.server.redirection
http.server.responses
furnace
furnace.actions
furnace.redirection
furnace.auth
furnace.auth.login
furnace.boilerplate
furnace.recaptcha
furnace.syndication
furnace.conversations ;

Next, I changed the validate-entity word. This word is used to validate a form submission when adding a new paste or a new annotation. Instead of validating a captcha using the old simple scheme, it now calls validate-recaptcha:

: validate-entity ( -- )
{
{ "summary" [ v-one-line ] }
{ "author" [ v-one-line ] }
{ "mode" [ v-mode ] }
{ "contents" [ v-required ] }
} validate-params
validate-recaptcha ;

Finally, I edited the new-paste.xml and new-annotation.xml templates to add a recaptcha component inside the t:form tag:

<tr><td colspan="2"><t:recaptcha /></td></tr>

The implementation of the furnace.recaptcha vocabulary is very straightforward and takes advantage of several features of Factor's web framework. It uses http.client to communicate with the recaptcha server, and furnace.conversations to pass the validation error between requests. Finally, it uses the CHLOE: parsing word to define the t:recaptcha tag for use in templates:

CHLOE: recaptcha drop [ render-recaptcha ] [xml-code] ;