Factor: a practical stack language: September 2010

Saturday, September 18, 2010

Factor 0.94 now available

Factor 0.94 is now available from the Factor website, five months after the previous release, Factor 0.93. Binaries are provided for 10 platforms.

As usual, contributors did most of the work. Thanks to Daniel Ehrenberg, Dmitry Shubin, Doug Coleman, Erik Charlebois, Joe Groff, John Benediktsson, Jose A. Ortega Ruiz, Niklas Waern, Samuel Tardieu, Sascha Matzke and everyone else who helped out this time around!

Incompatible changes:

The PowerPC architecture is no longer supported. (Slava Pestov)
The require-when word now supports dependencies on multiple vocabularies. (Daniel Ehrenberg)
The C-ENUM: word in the C library interface has been replaced with ENUM:, a much improved word for defining type-safe enumerations. (Erik Charlebois, Joe Groff)
Tuple slot setter words with stack effect ( value object -- ) are now named foo<< instead of (>>foo). Most code is unaffected since it uses the >>foo form. (Slava Pestov)
The older system-micros word, which returned microseconds since the Unix epoch as an integer, has been removed. For a while, the recommended way to get the current time has been to call the now word from the calendar vocabulary, which returned a timestamp instance. (Doug Coleman)
A few sequence-related words were moved from the generalizations vocabulary to sequences.generalizations. (Slava Pestov)
The alarms vocabulary has been renamed to timers to better explain its true purpose, with improved timing accuracy and robustness. (Doug Coleman)
cocoa.subclassing: the syntax for defining new Objective-C classes has been changed to improve readability. (Slava Pestov)
io.streams.limited: the ability to throw errors on EOF was extracted from limited streams, and limited streams simplified as a result. Throwing on EOF is now implemented by the io.streams.throwing vocabulary. (Doug Coleman)

New libraries:

boyer-moore: efficient text search algorithm (Dmitry Shubin)
checksums.internet: implementation of checksum algorithm used by ICMP for the checksums framework (John Benediktsson)
gdbm: disk-based hash library binding (Dmitry Shubin)
io.encodings.detect: binary file/text encoding detection heuristics from jEdit (Joe Groff)
javascriptcore: FFI to the WebKit JavaScript engine (Doug Coleman)
lua: FFI to the Lua scripting language (Erik Charlebois)
oauth: minimal implementation of client-side OAuth (Slava Pestov)
sequences.unrolled: efficient unrolled loops with constant iteration count (Joe Groff)

Improved libraries:

cuda: various improvements (Joe Groff, Doug Coleman)
game.input: now uses XInput2 on X11 (Niklas Waern)
gpu.render, gpu.buffers: various improvements (Joe Groff)
math.combinatorics: various improvements (John Benediktsson)
math.primes: various improvements (Samuel Tardieu)
math.vectors.simd.cords: compound "256-bit" SIMD types now support the full set of SIMD operations (Joe Groff)
mongodb: various improvements (Sascha Matzke)
twitter: now uses OAuth as required by Twitter (Slava Pestov)
unix.users, unix.groups: various improvements (Doug Coleman)

Compiler improvements:

Improved instruction selection, copy propagation, representation selection, and register allocation; details in a blog post. (Slava Pestov)
An instruction scheduling pass now runs prior to register allocation, intended to reduce register pressure by moving uses closer to definitions; details in a blog post. (Daniel Ehrenberg)
The code generation for the C library interface has been revamped; details in a blog post. (Slava Pestov)
Something similar to what C++ and C# programmers refer to as "value types"; binary data can now be allocated on the call stack, and passed to C functions like any other pointer. The with-out-parameters combinator replaces tricky code for allocating and juggling multiple temporary byte arrays used as out parameters for C function calls, making this idiom easier to read and more efficient. The with-scoped-allocation combinator presents a more general, lower-level interface. (Slava Pestov)
The compiler can now use the x87 floating point unit on older CPUs where SSE2 is not available. However, this is not recommended, because the build farm does not test Factor (or build any binaries) in x87 mode, so this support can break at any time. To use x87 code generation, you must download the source code and bootstrap Factor yourself, on a CPU without SSE2. (Slava Pestov)

Miscellaneous improvements:

fuel: Factor's Ultimate Emacs Library has seen many improvements, and also some keyboard shortcuts have changed; see the README. (Erik Charlebois, Dmitry Shubin, Jose A. Ortega Ruiz)
A new factor.cmd script is now included in the build-support directory, to automate the update/build/bootstrap cycle for those who build from source. Its functionality is a subset of the factor.sh script for Unix. (Joe Groff)
The default set of icons shipped in misc has been tweaked, with better contrast and improved appearance when scaled down. (Joe Groff)

Saturday, September 11, 2010

An overview of Factor's I/O library

Factor has grown a very powerful and high-level I/O library over the years, however it is easy to get lost in the forest of reference documentation surrounding the io vocabulary hierarchy. In this blog post I'm attempting to give an overview of the functionality available, with some easy-to-digest examples, along with links for futher reading. I will also touch upon some common themes that come up throughout the library, such as encoding support, timeouts, and uses for dynamically-scoped variables.

Factor's I/O library is the work of many contributors over the years. Implementing FFI bindings to native I/O APIs, developing high-level abstractions on top, and making the whole thing cross-platform takes many people. In particular Doug Coleman did a lot of heavy lifting early on for the Windows port, and also implemented several new cross-platform features such as file system metadata and memory mapped files.

Our I/O library is competitive with Python's APIs and Java's upcoming NIO2 in breadth of functionality. I like to think the design is quite a bit cleaner too, because instead of being a thin wrapper over POSIX we try to come up with clear and conherent APIs that make sense on both Windows and Unix.

First example: converting a text file from MacRoman to UTF8

The io.files vocabulary defines words for reading and writing files. It supports two modes of operation in a pretty standard fashion:

Stream-based: with-file-reader, with-file-writer
Entire file: file-contents, set-file-contents, file-lines, set-file-lines

What makes Factor's file I/O interesting is that it takes advantage of pervasive support for I/O encoding. In Factor, a string is not a sequence of bytes; it is a sequence of Unicode code points. When reading and writing strings on external resources, which only consist of bytes, an encoding parameter is given to specify the conversion from strings to byte arrays.

Let's convert foo.txt from MacRoman, an older encoding primarily used by classic Mac OS, to UTF8:

USING: io.encodings.8-bit.mac-roman io.encodings.utf8 io.files ;

"foo.txt" mac-roman file-contents
"foo.txt" utf8 set-file-contents

This is a very simple and concise implementation but it has the downside that the entire file is read into memory. For most small text files this does not matter, but if efficiency is a concern then we can do the conversion a line at a time:

USING: io io.encodings.8-bit.mac-roman io.encodings.utf8
io.files ;

"out.txt" utf8 [
    "in.txt" mac-roman [
        [ print ] each-line
    ] with-file-reader
] with-file-writer

Converting a directory full of files from MacRoman to UTF8

The io.files vocabulary defines words for listing and modifying directories. Let's make the above example more interesting by performing the conversion on a directory full of files:

USING: io.directories io.encodings.8-bit.mac-roman
io.encodings.utf8 io.files ;

: convert-directory ( path -- )
    [
        [
            [ mac-roman file-contents ] keep
            utf8 set-file-contents
        ] each
    ] with-directory-files ;

An aside: generalizing the "current working directory"

If you run the following, you will see that with-directory-files returns relative, and not absolute, file names:

"/path/to/some/directory"
[ [ print ] each ] with-directory-files

So the question is, how did file-contents above know what directory to look for files in? The answer is that in addition to calling the quotation with the directory's contents, the with-directory-files word also rebinds the current-directory dynamic variable.

This directory is the Factor equivalent of the familiar Unix notion of "current working directory". It generalizes the Unix feature by making it dynamically-scoped; within the quotation passed to the with-directory combinator, relative paths are resolved relative to that directory, but other coroutines executing at the time, or code after the quotation, is unaffected. This functionality is implemented entirely at the library level; all pathname strings are "normalized" with the normalize-pathname word before being handed off to the operating system.

When calling a shell command with io.launcher, the child process is run from the Factor current-directory so relative pathnames passed on the command line will just work. However, when making C FFI calls which take pathnames, you pass in absolute paths only, or normalize the path with normalize-path first, otherwise the C code wlll search for it in the wrong place.

Checking free disk space

The io.files.info vocabulary defines two words which return tuples containing information about a file, and the file system containing the file, respectively:

Let's say your application needs to install some files in the user's home directory, but instead of failing half-way through in the event that there is insufficient space, you'd rather display a friendly error message upfront:

ERROR: buy-a-new-disk ;

: gb ( m -- n ) 30 2^ * ;

: check-space ( -- )
    home file-system-info free-space>> 10 gb <
    [ buy-a-new-disk ] when ;

Now if there is less than 10gb available, the check-space word will throw a buy-a-new-disk error.

The file-system-info word reports a bunch of other info. There is a Factor implementation of the Unix df command in the tools.files vocabulary:

( scratchpad ) file-systems.
+device-name+ +available-space+ +free-space+ +used-space+ +total-space+ +percent-used+ +mount-point+
/dev/disk0s2  15955816448       16217960448  183487713280 199705673728  91             /
fdesc         0                 0            1024         1024          100            /dev
fdesc         0                 0            1024         1024          100            /dev
map -hosts    0                 0            0            0             0              /net
map auto_home 0                 0            0            0             0              /home
/dev/disk1s2  15922262016       15922262016  383489052672 399411314688  96             /Users/slava

Doug has two blog posts about these features, part 1 and part 2.

Unix only: symbolic links

Factor knows about symbolic links on Unix. The io.files.links vocabulary defines a pair of words, make-link and make-hard-link. The link-info word is like file-info except it doesn't follow symbolic links. Finally, the directory hierarchy traversal words don't follow links, so a link cycle or bogus link to / somewhere won't break everything.

File system monitoring

The io.monitors vocabulary implements real-time file and directory change monitoring. Unfortunately at this point in time, it is only supported on Windows, Linux and Mac. Neither one of FreeBSD and OpenBSD exposes the necessary information to user-space.

Here is an example for watching a directory for changes, and logging them:

USE: io.monitors

: watch-loop ( monitor -- )
    dup next-change path>> print flush watch-loop ;

: watch-directory ( path -- )
    [ t [ watch-loop ] with-monitor ] with-monitors ;

Try pasting the above code into a Factor listener window, and then run home watch-directory. Every time a file in your home directory is modified, its full pathname will be printed in the listener.

Java will only begin to support symbolic links and directory monitoring in the upcoming JDK7 release.

Memory mapped files

The io.mmap vocabulary defines support for working with memory-mapped files. The highest-level and easiest to use interface is the with-mapped-array combinator. It takes a file name, a C type, and a quotation. The quotation can perform generic sequence operations on the mapped file.

Here is an example which reverses each group of 4 bytes:

USING: alien.c-types grouping io.mmap sequences
specialized-arrays ;
SPECIALIZED-ARRAY: char

"mydata.dat" char [
    4 <sliced-groups>
    [ reverse! drop ] each
] with-mapped-array

The <sliced-groups> word returns a view of an underlying sequence, grouped into n-element subsequences. Mutating one of these subsequences in-place mutates the underlying sequence, which in our case is a mapped view of a file.

A more efficient implementation of the above is also possible, by mapping in the file as an int array and then performing bitwise arithmetic on the elements.

Launching processes

Factor's io.launcher vocabulary was originally developed for use by the build farm. The build farm needs to launch processes with precise control over I/O redirection and timeouts, and so a rich set of cross-platform functionality was born.

The central concept in the library is the process, tuple, constructed by calling <process>. Various slots of the process tuple can be filled in to specify the command line, environment variables, redirection, and so on. Then the process can be run in various ways, running in the foreground, in the background, or with input and output attached to Factor streams.

The launcher's I/O redirection is very flexible. If you don't touch the redirection slots in a process tuple, the subprocess will just inherit the current standard input and output. You can specify a file name to read or write from, a file name to append to, or even supply a pipe object, constructed from the io.pipes vocabulary.

<process>
    "rotate-logs" >>command
    +closed+ >>stdin
    "out.txt" >>stdout
    "error.log" <appender> >>stderr

It is possible to specify a timeout when running a process:

<process>
    { "ssh" "myhost" "-l" "jane" "do-calculation" } >>command
    15 minutes >>timeout
    "results.txt" >>stdout
run-process

The process will be killed if it runs for longer than the timeout period. Many other features are supported; setting environment variables, setting process priority, and so on. The io.launcher vocabulary has all the details.

Support for timeouts is a cross-cutting concern that touches many ports of the I/O API. This support is consolidated in the io.timeouts vocabulary. The set-timeout generic word is supported by all external resources which provide interruptible blocking operations.

Timeouts are implemented on top of our monotonic timer support; changing your system clock while Factor is running won't screw with active network connections.

Unix only: file ownership and permissions

The io.files.unix vocabulary defines words for reading and writing file ownership and permissions. Using this vocabulary, we can write a shell script to a file, make it executable, and run it. An essential component of any multi-language quine:

USING: io.encodings.ascii io.files io.files.info.unix
io.launcher ;

"""
#!/bin/sh
echo "Hello, polyglot"
""" "script.sh" ascii set-file-contents
OCT: 755 "script.sh" set-file-permissions
"./script.sh" run-process

There are even more Unix-specific words in the unix.users and unix.groups vocabularies. Using these words enables listing all users on the system, converting user names to UIDs and back, and even setuid and setgid.

Networking

Factor's io.sockets vocabulary supports stream and packet-based networking.

Stream-based: with-client, <server>, accept
Packet-based: <datagram>, send, receive

Network addresses are specified in a flexible manner. Specific classes exist for IPv4, IPv6 and Unix domain socket addressing. When a network socket is constructed, that endpoint is bound to a given address specifier.

Connecting to http://www.apple.com, sending a GET request, and reading the result:

USING: io io.encodings.utf8 io.sockets ;

"www.apple.com" 80 <inet> utf8 [
    """GET / HTTP/1.1\r
host: www.apple.com\r
connection: close\r
\r
""" write flush
    contents
] with-client
print

SSL support is almost transparent; the only difference is that the address specifier is wrapped in <secure>:

USING: io io.encodings.utf8 io.sockets
io.sockets.secure ;

"www.cia.gov" 443 <inet> <secure> utf8 [
    """GET / HTTP/1.1\r
host: www.cia.gov\r
connection: close\r
\r
""" write flush
    contents
] with-client
print

For details, see the io.sockets.secure documentation, and my blog post about SSL in Factor..

Of course you'd never send HTTP requests directly using sockets; instead you'd use the http.client vocabulary.

Network servers

Factor's io.servers.connection vocabulary is so cool, that a couple of years back I made a screencast about it. Nowadays the sample application developed in that screencast is in the extra/time-server; the implementation is very concise and elegant.

Under the hood

All of this functionality is implemented in pure Factor code on top of our excellent C library interface and extensive bindings to POSIX and Win32 in the unix and windows vocabulary hierarchies, respectively.

As much as possible, I/O is performed with non-blocking operations; synchronous reads and writes only suspend the current coroutine and switch to the next runnable one rather than hanging the entire VM. I recently rewrote the coroutines implementation to use direct context switching rather than continuations.

Co-ordination and scheduling of coroutines is handled with a series of simple concurrency abstractions.

Sunday, September 05, 2010

Making Factor's continuous build system more robust

I've done some work on Factor's continuous build system over the weekend to make it more robust in the face of failure, with improved error reporting and less manual intervention required to fix problems when they come up. The current build system is called "mason", because it is based on an earlier build system named "builder" that was written by Eduardo Cavazos. Every binary package you download from factorcode.org was built, tested and benchmarked by mason.

Checking for disk space

Every once in a while build machines run out of disk space. This is a condition that Git doesn't handle very gracefully; if a git pull fails half way through, it leaves the repository in an inconsistent state. Instead of failing during source code checkout or testing, mason now checks disk usage before attempting a build. If less than 1 Gb is free, it sends out a warning e-mail and takes no further action. Disk usage is also now part of every build report; for example, take a look at the latest Mac OS X report. Finally, mason does a better job of cleaning up after itself when builds fail, reducing the rate of disk waste overall.

I must say the disk space check was very easy to implement using Doug Coleman's excellent cross-platform file-system-info library. Factor's I/O libraries are top-notch and everything works as expected across all of the platforms we run on.

Git repository corruption

Git is not 100% reliable, and sometimes repositories will end up in a funny state. One cause is when the disk fills up in the middle of a pull, but it seems to happen in other cases too. For example, just a few days ago, our 64-bit Windows build machine started failing builds with the following error:

From git://factorcode.org/git/factor
 * branch            master     -> FETCH_HEAD
Updating d386ea7..eece1e3

error: Entry 'basis/io/sockets/windows/windows.factor' not uptodate. Cannot merge.

Of course nobody actually edits files in the repository in question, its a clone of the official repo that gets updated every 5 minutes. Why git messed up here I have no clue, but instead of expecting software to be perfect, we can design for failure.

If a pull fails with a merge error, or if the working copy somehow ends up containing modified or untracked files, mason deletes the repository and clones it again from scratch, instead of just falling over and requiring manual intervention.

Error e-mail throttling

Any time mason encounters an error, such as not being able to pull from the Factor Git repository, disk space exhaustion, or intermittent network failure, it sends out an e-mail to Factor-builds. Since it checks for new code every 5 minutes, this can get very annoying if there is a problem with the machine and nobody is able to fix it immediately; the Factor-builds list would get spammed with hundreds of duplicate messages. Now, mason uses a heuristic to limit the number of error e-mails sent out. If two errors are sent within twenty minutes of each other, no further errors are sent for another 6 hours.

More robust new code check

Previously mason would initiate a build if a git pull had pulled in patches. This was insufficient though, because if a build was killed half way through, for example due to power failure or machine reboot, it would not re-attempt a build when it came back up until new patches were pushed. Now mason compares the latest git revision with the last one that was actually built to completion (whether or not there were errors).

Build system dashboard

I've put together a simple dashboard page showing build system status. Sometimes VMs will crash (FreeBSD is particularly flaky when running under VirtualBox, for example) and we don't always notice that a VM is down until several days after, when no builds are being uploaded. Since mason now sends out heartbeats every 5 minutes to a central server, it was easy to put together a dashboard showing which machines have not sent any heartbeats for a while. These machines are likely to be down. The dashboard also allows a build to be forced even if no new code was pushed to the repository; this is useful to test things out after changing machine configuration.

The dashboard nicely complements my earlier work on live status display for the build farm.

Conclusion

I think mason is one of the most advanced continuous integration systems among open source language implementations, nevermind the less mainstream ones such as Factor. And thanks to Factor's advanced libraries, it is only 1600 lines of code. Here is a selection of the functionality from Factor's standard library used by mason:

io.files.info - checking disk usage
io.launcher - running processes, such as git, make, zip, tar, ssh, and of course the actual Factor instance being tested
io.timeouts - timeouts on network operations and child processes are invaluable; Factor's consistent and widely-used timeout API makes it easy
http.client - downloading boot images, making POST requests to builds.factorcode.org for the live status display feature
smtp - sending build report e-mails
twitter - Tweeting binary upload notifications
oauth - yes, Factor has a library to support the feared OAuth. Everyone complains about how hard OAuth is, but if you have easy to use libraries for HMAC, SHA1 and HTTP then it's no big deal at all.
xml.syntax - constructing HTML-formatted build reports using XML literal syntax

Friday, September 03, 2010

Two things every Unix developer should know

Unix programming can be tricky. There are many subtleties many developers are not aware of. In this post, I will describe just two of them... my favorite Unix quirks, if you will.

Interruptible system calls

On Unix, any system call which blocks can potentially fail with an errno of EINTR, which indicates that the caller must retry the system call. The EINTR error can be raised at any time for any reason, so essentially every I/O operation on a Unix system must be prepared to handle this error properly. Surprisingly to some, this includes the C standard library functions such as fread(), fwrite(), and so on.

For example, if you are writing a network server, then most of the time, you want to ignore the SIGPIPE signal which is raised when the client closes its end of a socket. However, this ignored signal can cause some pending I/O in the server to return EINTR.

A commonly-held belief is that setting the SA_RESTART flag with the sigaction() system call means that if that signal is delivered, system calls are restarted for you and EINTR doesn't need to be handled. Unfortunately this is not true. The reason is that certain signals are unmaskable. For instance, on Mac OS X, if your process is blocking reading on standard input, and the user suspends the program by sending it SIGSTOP (usually by pressing ^Z in the terminal), then upon resumption, your read() call will immediately fail with EINTR.

Don't believe me? The Mac OS X cat program is not actually interrupt-safe, and has this bug. Run cat with no arguments in a terminal, press ^Z, then type %1, and you'll get an error from cat!

$ cat
^Z
[1]+  Stopped                 cat
$ %1
cat
cat: stdin: Interrupted system call
$

As far as I'm aware, Factor properly handles interruptible system calls, and has for a while now, thanks to Doug Coleman explaining the issue to me 4 years ago. Not having to deal with crap like this (not to mention being able to write cross-platform code that runs on both Unix and Windows) is one of the advantages of using a high-level language like Factor or Java over C.

Subprocesses inherit semi-random things from the parent process

When you fork() your process, various things are copied from the parent to the child; environment variables, file descriptors, the ignored signal mask, and so on. Less obvious is the fact that exec() doesn't reset everything. If shared file descriptors such as stdin and stdout were set to non-blocking in the parent, the child will start with these descriptors non-blocking also, which will most likely break most programs. I've blogged about this problem before.

A similar issue is that if you elect to ignore certain signals with the SIG_IGN action using sigaction(), then subprocesses will inherit this behavior. Again, this can break processes. Until yesterday, Factor would ignore SIGPIPE using this mechanism, and child processes spawned with the io.launcher vocabulary that expected to receive SIGPIPE would not work properly. There are various workarounds; you can reset the signal mask before the exec() call, or you can do what I did in Factor, and ignore the signal by giving it an empty signal handler instead of a SIG_IGN action.

Factor: a practical stack language