Saturday, September 11, 2010

An overview of Factor's I/O library

Factor has grown a very powerful and high-level I/O library over the years, however it is easy to get lost in the forest of reference documentation surrounding the io vocabulary hierarchy. In this blog post I'm attempting to give an overview of the functionality available, with some easy-to-digest examples, along with links for futher reading. I will also touch upon some common themes that come up throughout the library, such as encoding support, timeouts, and uses for dynamically-scoped variables.

Factor's I/O library is the work of many contributors over the years. Implementing FFI bindings to native I/O APIs, developing high-level abstractions on top, and making the whole thing cross-platform takes many people. In particular Doug Coleman did a lot of heavy lifting early on for the Windows port, and also implemented several new cross-platform features such as file system metadata and memory mapped files.

Our I/O library is competitive with Python's APIs and Java's upcoming NIO2 in breadth of functionality. I like to think the design is quite a bit cleaner too, because instead of being a thin wrapper over POSIX we try to come up with clear and conherent APIs that make sense on both Windows and Unix.

First example: converting a text file from MacRoman to UTF8

The io.files vocabulary defines words for reading and writing files. It supports two modes of operation in a pretty standard fashion:

What makes Factor's file I/O interesting is that it takes advantage of pervasive support for I/O encoding. In Factor, a string is not a sequence of bytes; it is a sequence of Unicode code points. When reading and writing strings on external resources, which only consist of bytes, an encoding parameter is given to specify the conversion from strings to byte arrays.

Let's convert foo.txt from MacRoman, an older encoding primarily used by classic Mac OS, to UTF8:

USING: io.encodings.8-bit.mac-roman io.encodings.utf8 io.files ;

"foo.txt" mac-roman file-contents
"foo.txt" utf8 set-file-contents

This is a very simple and concise implementation but it has the downside that the entire file is read into memory. For most small text files this does not matter, but if efficiency is a concern then we can do the conversion a line at a time:

USING: io io.encodings.8-bit.mac-roman io.encodings.utf8
io.files ;

"out.txt" utf8 [
"in.txt" mac-roman [
[ print ] each-line
] with-file-reader
] with-file-writer

Converting a directory full of files from MacRoman to UTF8

The io.files vocabulary defines words for listing and modifying directories. Let's make the above example more interesting by performing the conversion on a directory full of files:

USING: io.directories io.encodings.8-bit.mac-roman
io.encodings.utf8 io.files ;

: convert-directory ( path -- )
[ mac-roman file-contents ] keep
utf8 set-file-contents
] each
] with-directory-files ;

An aside: generalizing the "current working directory"

If you run the following, you will see that with-directory-files returns relative, and not absolute, file names:

[ [ print ] each ] with-directory-files

So the question is, how did file-contents above know what directory to look for files in? The answer is that in addition to calling the quotation with the directory's contents, the with-directory-files word also rebinds the current-directory dynamic variable.

This directory is the Factor equivalent of the familiar Unix notion of "current working directory". It generalizes the Unix feature by making it dynamically-scoped; within the quotation passed to the with-directory combinator, relative paths are resolved relative to that directory, but other coroutines executing at the time, or code after the quotation, is unaffected. This functionality is implemented entirely at the library level; all pathname strings are "normalized" with the normalize-pathname word before being handed off to the operating system.

When calling a shell command with io.launcher, the child process is run from the Factor current-directory so relative pathnames passed on the command line will just work. However, when making C FFI calls which take pathnames, you pass in absolute paths only, or normalize the path with normalize-path first, otherwise the C code wlll search for it in the wrong place.

Checking free disk space

The vocabulary defines two words which return tuples containing information about a file, and the file system containing the file, respectively:

Let's say your application needs to install some files in the user's home directory, but instead of failing half-way through in the event that there is insufficient space, you'd rather display a friendly error message upfront:

ERROR: buy-a-new-disk ;

: gb ( m -- n ) 30 2^ * ;

: check-space ( -- )
home file-system-info free-space>> 10 gb <
[ buy-a-new-disk ] when ;

Now if there is less than 10gb available, the check-space word will throw a buy-a-new-disk error.

The file-system-info word reports a bunch of other info. There is a Factor implementation of the Unix df command in the tools.files vocabulary:

( scratchpad ) file-systems.
+device-name+ +available-space+ +free-space+ +used-space+ +total-space+ +percent-used+ +mount-point+
/dev/disk0s2 15955816448 16217960448 183487713280 199705673728 91 /
fdesc 0 0 1024 1024 100 /dev
fdesc 0 0 1024 1024 100 /dev
map -hosts 0 0 0 0 0 /net
map auto_home 0 0 0 0 0 /home
/dev/disk1s2 15922262016 15922262016 383489052672 399411314688 96 /Users/slava

Doug has two blog posts about these features, part 1 and part 2.

Unix only: symbolic links

Factor knows about symbolic links on Unix. The io.files.links vocabulary defines a pair of words, make-link and make-hard-link. The link-info word is like file-info except it doesn't follow symbolic links. Finally, the directory hierarchy traversal words don't follow links, so a link cycle or bogus link to / somewhere won't break everything.

File system monitoring

The io.monitors vocabulary implements real-time file and directory change monitoring. Unfortunately at this point in time, it is only supported on Windows, Linux and Mac. Neither one of FreeBSD and OpenBSD exposes the necessary information to user-space.

Here is an example for watching a directory for changes, and logging them:

USE: io.monitors

: watch-loop ( monitor -- )
dup next-change path>> print flush watch-loop ;

: watch-directory ( path -- )
[ t [ watch-loop ] with-monitor ] with-monitors ;

Try pasting the above code into a Factor listener window, and then run home watch-directory. Every time a file in your home directory is modified, its full pathname will be printed in the listener.

Java will only begin to support symbolic links and directory monitoring in the upcoming JDK7 release.

Memory mapped files

The io.mmap vocabulary defines support for working with memory-mapped files. The highest-level and easiest to use interface is the with-mapped-array combinator. It takes a file name, a C type, and a quotation. The quotation can perform generic sequence operations on the mapped file.

Here is an example which reverses each group of 4 bytes:

USING: alien.c-types grouping io.mmap sequences
specialized-arrays ;

"mydata.dat" char [
4 <sliced-groups>
[ reverse! drop ] each
] with-mapped-array

The <sliced-groups> word returns a view of an underlying sequence, grouped into n-element subsequences. Mutating one of these subsequences in-place mutates the underlying sequence, which in our case is a mapped view of a file.

A more efficient implementation of the above is also possible, by mapping in the file as an int array and then performing bitwise arithmetic on the elements.

Launching processes

Factor's io.launcher vocabulary was originally developed for use by the build farm. The build farm needs to launch processes with precise control over I/O redirection and timeouts, and so a rich set of cross-platform functionality was born.

The central concept in the library is the process, tuple, constructed by calling <process>. Various slots of the process tuple can be filled in to specify the command line, environment variables, redirection, and so on. Then the process can be run in various ways, running in the foreground, in the background, or with input and output attached to Factor streams.

The launcher's I/O redirection is very flexible. If you don't touch the redirection slots in a process tuple, the subprocess will just inherit the current standard input and output. You can specify a file name to read or write from, a file name to append to, or even supply a pipe object, constructed from the io.pipes vocabulary.

"rotate-logs" >>command
+closed+ >>stdin
"out.txt" >>stdout
"error.log" <appender> >>stderr

It is possible to specify a timeout when running a process:

{ "ssh" "myhost" "-l" "jane" "do-calculation" } >>command
15 minutes >>timeout
"results.txt" >>stdout
The process will be killed if it runs for longer than the timeout period. Many other features are supported; setting environment variables, setting process priority, and so on. The io.launcher vocabulary has all the details.

Support for timeouts is a cross-cutting concern that touches many ports of the I/O API. This support is consolidated in the io.timeouts vocabulary. The set-timeout generic word is supported by all external resources which provide interruptible blocking operations.

Timeouts are implemented on top of our monotonic timer support; changing your system clock while Factor is running won't screw with active network connections.

Unix only: file ownership and permissions

The io.files.unix vocabulary defines words for reading and writing file ownership and permissions. Using this vocabulary, we can write a shell script to a file, make it executable, and run it. An essential component of any multi-language quine:

USING: io.encodings.ascii io.files
io.launcher ;

echo "Hello, polyglot"
""" "" ascii set-file-contents
OCT: 755 "" set-file-permissions
"./" run-process

There are even more Unix-specific words in the unix.users and unix.groups vocabularies. Using these words enables listing all users on the system, converting user names to UIDs and back, and even setuid and setgid.


Factor's io.sockets vocabulary supports stream and packet-based networking.

Network addresses are specified in a flexible manner. Specific classes exist for IPv4, IPv6 and Unix domain socket addressing. When a network socket is constructed, that endpoint is bound to a given address specifier.

Connecting to, sending a GET request, and reading the result:

USING: io io.encodings.utf8 io.sockets ;

"" 80 <inet> utf8 [
"""GET / HTTP/1.1\r
connection: close\r
""" write flush
] with-client

SSL support is almost transparent; the only difference is that the address specifier is wrapped in <secure>:

USING: io io.encodings.utf8 io.sockets ;

"" 443 <inet> <secure> utf8 [
"""GET / HTTP/1.1\r
connection: close\r
""" write flush
] with-client

For details, see the documentation, and my blog post about SSL in Factor..

Of course you'd never send HTTP requests directly using sockets; instead you'd use the http.client vocabulary.

Network servers

Factor's io.servers.connection vocabulary is so cool, that a couple of years back I made a screencast about it. Nowadays the sample application developed in that screencast is in the extra/time-server; the implementation is very concise and elegant.

Under the hood

All of this functionality is implemented in pure Factor code on top of our excellent C library interface and extensive bindings to POSIX and Win32 in the unix and windows vocabulary hierarchies, respectively.

As much as possible, I/O is performed with non-blocking operations; synchronous reads and writes only suspend the current coroutine and switch to the next runnable one rather than hanging the entire VM. I recently rewrote the coroutines implementation to use direct context switching rather than continuations.

Co-ordination and scheduling of coroutines is handled with a series of simple concurrency abstractions.


Unknown said...

an overview of the functinality available

-> functionality

Wolf550e said...

"A more efficient implementation of the above is also possible, by mapping in the file as an int array and then performing bitwise arithmetic on the elements."

What prevents the code in your example from being compiled into the same efficient machine code as bitwise arithmetic on ints?

Anonymous said...

The networking part might need a few minor corrections. You say you're connecting to "", but you actually query And connecting securely to CIA and reporting host Did I miss a joke or something?

Bhanu Tiwari said...

When I read this editorial I determined Factor: a practical stack language blog's commenters really should view this! . I can't comprehend renting a netbook at all! The rate of renting a notebook even for just about two weeks will charge you as much as outright going online and purchasing the laptop!

JinTheTin said...

Having been a Forth developer years ago, I was interested in using it for a possible project, and came across Factor. Very interesting looking language, however, documentation is obtuse at best, unless I am missing something. I have also not found any third party documentation, which suggests this is more of a personal project. I don't intend to be critical, I am interested in evaluating Factor, just can't seem to find what I need.