Before I began optimizing today, current Factor ran the benchmark in 140 seconds; a slight improvement from all the compiler work between 0.83 and 0.88. Still unacceptable, however. The first thing I did was reimplement how line reading is done. Previously, the
stream-readlnword would call
stream-read1to read data a character a time, and form lines. While input is buffered, this nevertheless incurred considerable overhead, because of generic dispatch, bounds checks on the I/O buffer, and other factors. To rectify this, I added a new
stream-read-untilgeneric word which reads data from a stream until a character is reached which appears in a given sequence of separators. The Unix native I/O code has an efficient implementation of this word which scans the input buffer for separators, without a large amount of indirection and redundant work on every iteration. I rewrote the line reading code to call this word with a separator sequence of "\r\n" (and some extra logic to handle Windows-style \r\n line endings). This reduced the time to read the file line by line from 90 seconds to 3, and the total benchmark runtime went from 140 seconds to 80 seconds. A nice improvement, but where was the remaining time being spent? Turns out the compiler was not inferring all the type information that it could. Right now, the Factor compiler only infers types inside a word definition, and assumes that non-inlined calls to other words always produce objects of the most general type (with a few exceptions for primitives and certain low-level words of which the optimizer has special knowledge). While I plan on adding global type inference to 0.89, for now marking the
<sbuf>:constructor word inline was sufficient to give the compiler enough type information in the inner loop of the benchmark to cut the run time down to 33 seconds.
33 seconds is still far too slow for this problem - as I wrote in the original weblog entry where I noticed the lousy I/O performance, on my machine Java runs the benchmark in 3 seconds. However I was able to increase the runtime by a factor of 3 with relatively little effort (and no changes to the benchmark, other than a cleanup which only slightly improved performance). Also I/O is no longer the major bottleneck. Instead, some 27 seconds are spent simply reversing sequences.
If you look at the performance of other languages on this benchmark in the computer language shootout, you'll notice that the languages that do well either delegate all the hard work to C, despite being rather slow themselves (Perl, Python) or do not suffer from the overhead of dynamic type dispatch (bigForth because Forth is inherently untyped, and the SBCL benchmark implementation which completely bypasses CL line reading functions and switches off safety checks with declarations).
In Factor's case, not only are I/O and sequence operations are written in high-level Factor code, but the language is highly dynamic, and the optimizer is still not particularly sophisticated, despite continuous improvements. I'm confident however that within a few months, Factor will match bigForth's performance in most cases, and at that point I might consider submitting Factor to the language shootout.
Also a curious note is that there are two bigForth programs listed in the reverse complement benchmark; both are almost identical, except one program does extremely well, and another one is the slowest out of all language implementations by an order of magnitude. I wonder what's going on there.