Factor: a practical stack language: September 2008

Tuesday, September 30, 2008

New and improved Factor web applications

I've been working on Factor's "Furnace" web framework lately, fixing bugs, adding missing features and writing docs. I've also put up some web applications to demonstrate the framework. There is one new web app:

concatenative.org - the concatenative languages wiki. This wiki is the most elaborate Factor web application at this point in time, and it would not have been possible without some of the nice libraries that we've been working on lately. It showcases Doug Coleman's RDBMS library and his "Farkup" markup language, as well as the Furnace authentication code which supports login over SSL. Despite all of the functionality, the source code is pretty compact. I'm interested in using the wiki to collect information about concatenative languages in general, as well as making Factor's web presence more comprehensive. For example, there is now a public to do list.

I also worked on two web apps which we had before, but have been out of commission for a while; they're back and better than ever:

docs.factorcode.org - this is a new web app for browsing the Factor documentation. We've had this feature before, but it was implemented differently; it would render the docs from the HTTP server's image. While this was a neat demo of Factor's reflection features, it required loading the help system on the server, and it would not reflect the latest documentation. With the new setup, I generate the docs offline from an image where everything is loaded, then upload the HTML files (150Mb worth!) to the web site. A small web app takes care of the search feature but most of the content is static. I'm going to hook this into the build farm so that one of the machines is tasked with uploading the very latest docs on a regular basis.
gitweb.factorcode.org - this is a gitweb instance for browsing the Factor repository online. There's nothing new here, other than the fact that it's using the CGI support from the latest Factor web server.

Finally, two existing web apps which have been running for about a year now have been ported to the new web framework and have received minor facelifts as a result:

paste.factorcode.org - the new pastebin. The pastebin is intended primarily for IRC users chatting and sharing code in #concatenative. Unlike the old version, which stored pastes in memory, pastes are now persisted in a database and we have backups going; there is also an admin interface I can use to delete spam. Otherwise, there is no change in functionality.
planet.factorcode.org - the new blog aggregator. The main improvement is that there is an web-based admin interface, so I can add and remove blog feeds without having to log into the server, hack the source, and reload.

The main Factor web site is still running on Factor 0.91 from a year ago; I will be overhauling this and migrating it to the new server soon.

Factor's libraries for writing web applications have matured significantly over the last year, but there is still a lot of work that remains. Still, I'm pretty excited about the possibilities here. Few other languages have a "full stack" web framework like Factor does.

Saturday, September 20, 2008

Mac OS X x86-64 binary package available; Factor now supports 12 platforms!

About a year ago, Daniel Ehrenberg did the initial port of Factor to 64-bit Mac OS X, however the UI did not work. This is because Factor's Cocoa binding used the Objective C 1.0 API, and when Apple introduced Objective C 2.0 in Mac OS X 10.5, they only implemented backwards compatibility for the old API on 32-bit platforms. However, just recently Joe Groff updated our Cocoa binding for the new Objective C 2.0 API, and after I fixed a few bugs in the FFI on x86-64, the Factor UI now works!

I set up one of our build machines to build Mac OS X x86-64 binaries. This brings the total number of fully supported platforms to 12 -- where "fully supported" means that we have a machine in the build farm which loads all libraries, runs unit tests and benchmarks, and uploads binaries if everything passes.

It also means the end of the road as far as OS X 10.4 support is concerned. The Cocoa bridge no longer supports Objective C 1.0. However, due to popular demand we may set up a build machine with OS X 10.4 which will build Factor binaries with the X11 UI.

If you haven't yet, go download Factor and give it a spin. It's fun!

Saturday, September 13, 2008

Improved performance on the raytracer benchmark

The new optimizer was missing two optimizations performed by the old one; arithmetic identity optimizations, and modular arithmetic optimizations. This brings the total number of optimization passes to 11 (def-use analysis runs twice).

The first optimization involves removing such redundant code sequences as 0 +, 1 * and 0 bitor. The old optimizer performed this unconditionally, which was incorrect since most of these identities fail to hold for floats. For example, -0.0 0 + is 0.0, so in this case 0 + is not a no-op, and 1.0 0.0 / 0.0 * is NaN, so 0.0 * does not always yield zero. However, they are valid for integers, so the new optimizer only applies these optimizations to integer-valued expressions.

The second optimization is described in the blog post I linked to.

I also found and fixed a code generator regression which has been there for ages, and a type inference regression which appeared with the new code generator. Fixing these two regressions and implementing the two missing optimizations above improved performance of benchmark.raytracer by almost 2x; the run time went from 12 seconds to 6.5 seconds on my 2.4 GHz Core 2 Duo MacBook Pro.

The Java version runs in 1.1 seconds with the server VM (I modified the code to run the benchmark 10 times in a loop to give the JIT a chance to do some runtime profiling, and I measured the run time internally so as to avoid JVM startup overhead). The Common Lisp version runs in 3.5 seconds under SBCL. Note that I'm running the benchmark with n=200 and level=3.

Factor is slowly inching towards the performance offered by other compilers. Hopefully one day I'll be able to close the gap.

Friday, September 12, 2008

Sorting strings with embedded numbers

Suppose you have a list of file names:

picture1.txt
picture2.txt
picture3.txt
picture4.txt
picture5.txt
picture6.txt
picture7.txt
picture8.txt
picture9.txt
picture10.txt

Lexicographic order will sort picture10.txt before picture2.txt, which is not what you want. The correct solution is to recognize that the strings contain numbers, and special-case their comparison.

Doug Coleman wrote a Factor solution last December.

Today I was revisiting his code and noticed that it could be simplified significantly using Chris Double's amazing parser expression grammars library:

USING: peg.ebnf math.parser kernel assocs sorting ;
IN: sorting.human

: find-numbers ( string -- seq )
    [EBNF Result = ([0-9]+ => [[ string>number ]] | (!([0-9]) .)+)* EBNF] ;

: human-sort ( seq -- seq' )
    [ dup find-numbers ] { } map>assoc sort-values keys ;

I checked the code into the repository; you can USE: sorting.human to use it.

A bit of explanation is in order.

The PEG [0-9]+ => [[ string>number ]] matches one or more digits (in the range 0 to 9), and applies the action string>number (a Factor word which does what the name says).

The PEG (!([0-9]) .)+ matches one or more characters which are not digits.

The PEG idiom (A | B)* matches a sequence of zero or more A's and B's, which is exactly what we do here with the above two PEGs.

The PEG parser is wrapped in [EBNF ... EBNF], which is "inline PEG notation"; it expands into code which takes a string as input and returns a parse result as output. So calling find-numbers splits out any embedded numbers in the string.

The phrase [ dup find-numbers ] { } map>assoc builds an association list from a sequence where each sequence element is associated with a tokenized string. We use sort-values to sort the association list by comparing values, then use keys to extract the sequence elements, now in sorted order. Beautiful.

I solved the same problem in Java several years ago. The result was not pretty at all:

public static int compareStrings(String str1, String str2, boolean ignoreCase)
{
    char[] char1 = str1.toCharArray();
    char[] char2 = str2.toCharArray();

    int len = Math.min(char1.length,char2.length);

    for(int i = 0, j = 0; i < len && j < len; i++, j++)
    {
        char ch1 = char1[i];
        char ch2 = char2[j];
        if(Character.isDigit(ch1) && Character.isDigit(ch2)
            && ch1 != '0' && ch2 != '0')
        {
            int _i = i + 1;
            int _j = j + 1;

            for(; _i < char1.length; _i++)
            {
                if(!Character.isDigit(char1[_i]))
                    break;
            }

            for(; _j < char2.length; _j++)
            {
                if(!Character.isDigit(char2[_j]))
                    break;
            }

            int len1 = _i - i;
            int len2 = _j - j;
            if(len1 > len2)
                return 1;
            else if(len1 < len2)
                return -1;
            else
            {
                for(int k = 0; k < len1; k++)
                {
                    ch1 = char1[i + k];
                    ch2 = char2[j + k];
                    if(ch1 != ch2)
                        return ch1 - ch2;
                }
            }

            i = _i - 1;
            j = _j - 1;
        }
        else
        {
            if(ignoreCase)
            {
                ch1 = Character.toLowerCase(ch1);
                ch2 = Character.toLowerCase(ch2);
            }

            if(ch1 != ch2)
                return ch1 - ch2;
        }
    }

    return char1.length - char2.length;
}

This code really sucks. The solution is very imperative and low-level; to understand what it does, you have to execute the code in your head and keep track of the various loops, early exits, and mutable local variables. I no longer try to solve problems in these low-level terms. With Factor I can solve the problem without "playing CPU". A lot of people, when they're first starting out with Factor, try to write code much like the above using a stack. Predictably, the stack shuffling gets out of hand. While the above algorithm could be implemented quite easily in Factor using locals, it would still be 20 times longer and more complicated than the high-level solution using PEGs.

Understanding the various abstractions and libraries which are available is key to grasping Factor: collections, generic words, fry, locals, macros, memoization, PEGs, the prettyprinter, and so on. Making effective use of these tools can reduce the amount of work required to solve a problem by an order of magnitude. This is why I invest a lot of effort into writing documentation, as well as the help system itself. Making Factor easy to learn and explore is a high priority for me.

In the Java community, there is a certain derision against "API monkeys": programmers who always reach for a library without being able to solve a problem using just the language itself. Factor programmers on the other hand should be API monkeys and proud of it!

Monday, September 08, 2008

OpenBSD cannot allocate more than 1Gb of RAM per process

It's 2008, and OpenBSD has a limit of 1Gb of RAM per process, even on 64-bit architectures.

This has been discussed many times over the years, and apparently the OpenBSD developers don't consider fixing this to be a priority, it has gone neglected for 4 years.

Pretty sad, really. I guess OpenBSD is great as a router on a 486, but for workstation or server use I recommend looking into NetBSD, FreeBSD or Linux instead.

New location for Factor downloads

The Linode virtual server hosting http://factorcode.org only has 10Gb of disk space. Over the last 7 months, almost daily uploads of binary packages from 11 platforms has done a pretty good job of filling it up.

To rectify this, we have moved the download files over to downloads.factorcode.org, which is a hosted by DreamHost shared hosting. This account has more than 300Gb available and we're not likely to exhaust it any time soon. The binaries page has not moved but the links on it now point to the new location.

The Factor web site continues to run on a Linode; having root access is very useful, because for one, we run the Factor HTTP server on there. Other than a few glitches caused by the CGI support, the Factor HTTP server handles the load very well; it could handle serving all the downloads just fine too, if it weren't for the disk space issue.

The binaries page is an .fhtml script, meaning it is an HTML file with embedded Factor. Formerly, it would look at the download directories, find the latest package for each platform, and compute the table of links on every request. Now that downloads are not hosted on the same server as the web page, I had to replace this with a more complex setup. A Factor script runs as an hourly cron job under our DreamHost account; it generates this table and spits it out at "http://downloads.factorcode.org/downloads.html. The binaries page then uses the http.client vocabulary to download this file and include it. This means that for every HTTP request to the binaries page, our Linode server makes an HTTP request to the DreamHost box; if this becomes a problem, I'll investigate some kind of caching, but I don't think it will.

I put the code for this script up in our pastebin; it is not very pretty, but note the "shebang" line at the top of the file, making executable from the shell.

This is all part of a pretty complicated distributed system written in Factor. There are 11 machines (some virtual) uploading binary packages; a Factor script on one machine tabulates them, and a Factor HTTP server on another machine receives the results of this tabulation and serves up the dynamically-generated downloads page where users can download the binaries. The developers just push patches, without worrying about making releases, updating the web site, and now, disk space filling up.

There is one final improvement to the build infrastructure. When a binary package is uploaded, it is uploaded with a name ending with .incomplete; only once the upload succeeds is it renamed to its final name. This prevents the binaries page from being generated with links to incomplete packages. This was a problem before; downloading a binary package at the wrong time could give you a truncated tarball. This won't happen again.

Saturday, September 06, 2008

Solution for slow localhost lookups with getaddrinfo on OS X

I noticed that sometimes, running \ foo edit would take several seconds to execute; sometimes as long as a minute. On my laptop, the editors.jedit vocabulary is loaded, meaning the edit hook connects to the running jEdit instance over a socket on localhost, transmitting the file name and line number which defines foo.

After fiddling around for a while with watch and making it print messages when a word was called and returned, I noticed my observing the messages that it was blocking in name resolution. And indeed, resolving localhost seemed to take a while:

( scratchpad ) [ "localhost" 1234  resolve-host drop ] time
==== RUNNING TIME

194 ms

One would expect it to return instantly, but obviously, there's a DNS query happening here, which should not be the case since localhost is defined in /etc/hosts.

Factor uses getaddrinfo() for name resolution. I tried the older, deprecated gethostbyname() function:

( scratchpad ) FUNCTION: void* gethostbyname ( char* name ) ;
( scratchpad ) [ "localhost" gethostbyname drop ] time
==== RUNNING TIME

0 ms

No performance problem there.

The next step was to look at the tcpdump output:

05:50:05.340667 IP 192.168.1.105.62560 > cns.westlandrdc.mi.michigan.comcast.net.domain: 27262+ SRV? _http._tcp.localhost. (38)
05:50:05.374190 IP cns.westlandrdc.mi.michigan.comcast.net.domain > 192.168.1.105.62560: 27262 NXDomain 0/1/0 (113)
05:50:05.374492 IP 192.168.1.105.62561 > cns.westlandrdc.mi.michigan.comcast.net.domain: 518+[|domain]
05:50:05.424012 IP cns.westlandrdc.mi.michigan.comcast.net.domain > 192.168.1.105.62561: 518 NXDomain 0/1/0 (138)
05:50:05.424568 IP 192.168.1.105.62562 > cns.westlandrdc.mi.michigan.comcast.net.domain: 14115+ SRV? _http._tcp.localhost.mn.comcast.net. (53)
05:50:05.474268 IP cns.westlandrdc.mi.michigan.comcast.net.domain > 192.168.1.105.62562: 14115 NXDomain 0/1/0 (105)
05:50:05.474767 IP 192.168.1.105.62563 > cns.westlandrdc.mi.michigan.comcast.net.domain: 56054+ SRV? _http._tcp.localhost.comcast.net. (50)
05:50:05.530655 IP cns.westlandrdc.mi.michigan.comcast.net.domain > 192.168.1.105.62563: 56054 NXDomain 0/1/0 (102)

As you can see, Factor is making some SRV DNS queries here. This made me think of the servname parameter to getaddrinfo(). We always set this to "http", since I found that on FreeBSD, setting it to a string representation of an integer does not work. However, on Mac OS X 10.5, it appears that DNS queries are always made if this field is set, even if the host name is listed in /etc/hosts. Changing the resolver code to pass a NULL pointer as servname seems to fix the problem, and I verified it to work on Windows, FreeBSD, OpenBSD, NetBSD and Linux.

A minor refactoring later, and its as fast as it should be, with no network traffic generated:

( scratchpad ) [ "localhost" 1234  resolve-host drop ] time
==== RUNNING TIME

0 ms

So perhaps this is completely obvious to some people, but I never realized that setting servname to a non-NULL value would cause DNS queries to be performed when resolving localhost. Certainly this isn't documented anywhere, at least not that I can see.

In conclusion, if you are developing a program which communicates with another service running on the same machine via TCP/IP or UDP/IP, and you notice that getaddrinfo() blocks for a long time, try changing the code to pass NULL for servname.

Improved tuple literal syntax

Suppose we define a tuple class:

TUPLE: color red green blue ;

We can make a new instance as follows:

1 2 3 color boa

Or like this:

color new
    1 >>red
    2 >>green
    3 >>blue

You can also write a tuple literal. This creates the tuple at parse time, not at compile time, so every time the surrounding quotation evaluates, the same tuple is pushed on the stack:

T{ color f 1 2 3 }

Notice how when constructing a tuple, we have a choice between either positional slot values (boa, which means by order of arguments) or named slot values (using new, which makes an empty tuple). However no such possibility exists with the literal syntax, which only supports positional slot values.

Formerly, the first slot in the tuple was the delegate, and this was almost always f. After removing delegation, I kept the f as part of the syntax has been removed, because I didn't want to go through and update all usages of literal tuples in the source tree to not contain the f. That's a lot of pain for little gain.

What I did instead was turn the tables and change the meaning of the f. Now, if the first token after the class name is f, the remaining tokens are the slot values in positional order. If it is any other token, then the slot values are a list of name/value pairs, like so:

T{ color { red 1 } { green 2 } { blue 3 } }

A syntax for literal tuples where the slots are named was originally proposed by Daniel Ehrenberg; however his syntax was different:

T{ color red: 1 green: 2 blue: 3 }

His syntax is more concise, but it was inconsistent with hashtable syntax. However he got the ball rolling and made us all think about possible improvements to tuple literal syntax.

Originally I was going to introduce the new syntax and remove the old one, but Eduardo mentioned on the mailing list that he likes the "BOA" literal syntax sometimes too, for shorter tuples with fewer slots (such as colors above). So at that time, we came up with this trick of changing the meaning of the f.

I think this is very clever, because it creates less work for me than removing all occurrences of the initial f from tuple literals, and it creates a new way to describe literal tuples which have many slots, or where only a few slots need to be set, which is more readable than the BOA syntax. And of course the BOA syntax is still there in the cases where it makes sense too.

Six months ago, I set out a roadmap for language evolution. Multi-methods still remain, and they're coming up soon. I proposed a syntax for multimethods a few months ago and since then I've thought about it some more and the final syntax will likely be different. I will blog about this in detail when my thoughts on the matter are more concrete.

Tuesday, September 02, 2008

Clearing out the dead wood

Old accessors removed

Back in the day, when you defined

TUPLE: person name age ;

You'd get a set of words person-name, person-age, set-person-name, set-person-age. The "new accessors" language change instead defined methods on the words name>>, age>>, >>name, >>age. More concise code, and more opportunities for polymorphism (you can write a word which calls name>> and pass it anything with that slot, not just a person) and less stack shuffling, since the new setter effect seems better in practice.

While most new code used new accessors, we still had some older code which used the old accessors. With very little help from me, Doug and Ed updated all of the remaining code in the repository to use new accessors and even removed them from the core.

Removing old accessors reduced the image size from 43Mb to 36Mb on Mac OS X x86, and correspondingly on other platforms. The reduced size of 36Mb seems large, however this is the default development image and it includes a lot of stuff, such as all the documentation, tools, and the UI. Nevertheless, it is still on the large side and I'm sure I will find tricks in the future to reduce its size. It is just that for the most part my focus has been on performance, features and bug fixes. Optimizing Factor for memory-constrained environments is another whole sub-project, something I'll tackle eventually.

In addition to image size, long bootstrap time is another contributing factor (pun intended) the impression some people have of Factor being bloated. Bootstrap time has improved recently too; the new optimizer is a lot faster than the old one, and removing old accessors reduced the number of words that need to be compiled which brought it down another notch. A week ago bootstrap took 8 minutes on my MacBook, and now it is down to around 4 minutes 20 seconds. Another 2x improvement would be nice.

New equality semantics on numbers

A change I've been contemplating for a while is changing the semantics of equality on numbers to only consider two numbers equal if they have the same value and the same type. I finally applied this change today. This means that 3 3.0 = would evaluate to true before, but now it is false. If you want the old semantics, the number= word still exists; 3 3.0 number= evaluates to true.

Performance improvements

The main reason I made the above change is to improve performance. First of all, there is less dispatching when numbers are compared for equality. Second of all, it helps with type inference of complex number arithmetic. This change implies that constructing a complex number with an imaginary part equal to floating point zero outputs a new complex number. Because of this, arithmetic operations on complex number inputs with floating point components always output complex numbers with floating point components, with no possibility of outputting a real number. This eliminates unnecessary dispatch.

Consider this word from the mandelbrot benchmark:

: iter ( c z nb-iter -- x )
    dup 0 <= [ 2nip ] [
        over absq 4.0 >= [ 2nip ] [
            >r sq dupd + r> 1- iter
        ] if
    ] if ; inline recursive

The word was called from one place:

c C{ 0.0 0.0 } 60 iter

Before I changed equality semantics on numbers, there was very little that the compiler could do to optimize this, because all of the arithmetic operations would either output real or complex numbers. But now, since the inputs are known to be complex numbers with floating point components, so are the outputs, so all of the generic arithmetic dispatch can be eliminated. Furthermore, the new escape analysis pass unboxes all intermediate complex numbers. After inlining, dispatch elimination and unboxing, the above word turns into this monstrosity:

( scratchpad ) [ { float float } declare  C{ 0.0 0.0 } nb-iter iter ] optimized.
[
    0.0 0.0 40 \ ( gensym ) [
        dup 0 fixnum<=
        [ ( 1262 1263 1264 1265 10 -- 10 ) ] [
            ( 1264 1265 10 -- 1264 1265 10 1264 1265 )
            >r r>
            >r dup float* r>
            dup float* float+ 4.0 float>=
            [ ( 1262 1263 1266 1267 18 -- 18 ) ] [
                >r
                2dup
                ( 1278 1279 1280 1281 -- 1278 1279 1280 1281 1278 1279 1280 1281 )
                ( 1286 1287 1288 1289 -- 1286 1288 1289 1287 )
                ( 1292 1293 1295 -- 1292 1295 1293 )
                >r
                >r
                >r r>
                r>
                r>
                >r r>
                float*
                >r float* r>
                float-
                ( 1282 1283 1284 1285 769 -- 769 1282 1283 1284 1285 )
                ( 1322 1323 1324 1325 -- 1322 1324 1325 1323 )
                ( 1328 1329 1331 -- 1328 1331 1329 )
                >r
                >r
                >r r>
                r>
                r>
                >r r>
                >r float* swap r>
                float* float+
                ( 1262 1263 1362 1363 -- 1262 1263 1262 1263 1362 1363 )
                ( 1366 1367 1368 1369 -- 1366 1368 1369 1367 )
                ( 1372 1373 1375 -- 1372 1375 1373 )
                >r
                >r
                >r r>
                r>
                r>
                >r r>
                float+
                >r float+ r>
                r>
                1 fixnum-fast ( gensym )
            ] if
        ] if
    ] label
]

Other than the stack shuffles, we see that the only words which remain are primitives such as float+ and float* so in fact, unboxed float arithmetic is taking place here.

The ( ... -- ... ) syntax is used by the prettyprinter for compiler IR to express stack shuffles for which no corresponding word exists. These come up as a result of other optimizations. In fact the overwhelming majority of the stack shuffles above are redundant, and they are eliminated by the code generator pass which runs after the optimizer. Compare with copy propagation and coalescing in compilers for applicative languages; optimizers are written to freely introduce redundant copies of values, with the knowledge that a subsequent pass eliminates them, because it simplifies the implementation.

So indeed, the above output is close to optimal as far as the optimizer is concerned. The code generator could still do register allocation more efficiently and generate better assembly sequences for float arithmetic, but this is a project for later.

The mandelbrot benchmark now runs in about 350 milliseconds, compared to 1.8 seconds before. This is a huge improvement, and it was all thanks to the new optimizer and the minor change to numeric equality semantics. I remember when the mandelbrot benchmark would take 20 seconds to run; it is one of the oldest benchmarks for Factor and eliminating all of the memory allocation and dispatch from the inner loop above has literally taken 4 years of work.

Factor: a practical stack language