Friday, January 19, 2007

Faster performance on the reverse-complement benchmark

A while ago, I did some Factor I/O benchmarking. At the time (around version 0.83), Factor's performance on the benchmark was poor, with a total run time of 170 seconds, 120 of which were spent reading the input file line by line. The input file in question is 25 Mb in size, consisting of 416671 lines of text.

Before I began optimizing today, current Factor ran the benchmark in 140 seconds; a slight improvement from all the compiler work between 0.83 and 0.88. Still unacceptable, however. The first thing I did was reimplement how line reading is done. Previously, the stream-readln word would call stream-read1 to read data a character a time, and form lines. While input is buffered, this nevertheless incurred considerable overhead, because of generic dispatch, bounds checks on the I/O buffer, and other factors. To rectify this, I added a new stream-read-until generic word which reads data from a stream until a character is reached which appears in a given sequence of separators. The Unix native I/O code has an efficient implementation of this word which scans the input buffer for separators, without a large amount of indirection and redundant work on every iteration. I rewrote the line reading code to call this word with a separator sequence of "\r\n" (and some extra logic to handle Windows-style \r\n line endings). This reduced the time to read the file line by line from 90 seconds to 3, and the total benchmark runtime went from 140 seconds to 80 seconds. A nice improvement, but where was the remaining time being spent? Turns out the compiler was not inferring all the type information that it could. Right now, the Factor compiler only infers types inside a word definition, and assumes that non-inlined calls to other words always produce objects of the most general type (with a few exceptions for primitives and certain low-level words of which the optimizer has special knowledge). While I plan on adding global type inference to 0.89, for now marking the <sbuf>: constructor word inline was sufficient to give the compiler enough type information in the inner loop of the benchmark to cut the run time down to 33 seconds.

33 seconds is still far too slow for this problem - as I wrote in the original weblog entry where I noticed the lousy I/O performance, on my machine Java runs the benchmark in 3 seconds. However I was able to increase the runtime by a factor of 3 with relatively little effort (and no changes to the benchmark, other than a cleanup which only slightly improved performance). Also I/O is no longer the major bottleneck. Instead, some 27 seconds are spent simply reversing sequences.

If you look at the performance of other languages on this benchmark in the computer language shootout, you'll notice that the languages that do well either delegate all the hard work to C, despite being rather slow themselves (Perl, Python) or do not suffer from the overhead of dynamic type dispatch (bigForth because Forth is inherently untyped, and the SBCL benchmark implementation which completely bypasses CL line reading functions and switches off safety checks with declarations).

In Factor's case, not only are I/O and sequence operations are written in high-level Factor code, but the language is highly dynamic, and the optimizer is still not particularly sophisticated, despite continuous improvements. I'm confident however that within a few months, Factor will match bigForth's performance in most cases, and at that point I might consider submitting Factor to the language shootout.

Also a curious note is that there are two bigForth programs listed in the reverse complement benchmark; both are almost identical, except one program does extremely well, and another one is the slowest out of all language implementations by an order of magnitude. I wonder what's going on there.

3 comments:

Anonymous said...

The difference in performance could be because the second Forth implementation allocates the entire space for the buffer before anything actually gets run. The other version looks like it doesn't do this.

btw, congrats on fixing up IO performance, seems like that's been bugging you lately. :)

Unknown said...

What's going on in the fast bigForth implementation is a reimplementation of line reading, using buffers. The slow one uses a word called read-line, which, I'm guessing, is responsible for the slowness. This is gonna be tough, getting good performace for factor while maintaining the use of readln.

Slava Pestov said...

Dan, like I said I/O only accounts for about 5 seconds of the total runtime of the benchmark. So readln is not a performance problem at all. The slowdown comes from the actual processing done by the benchmark, and this is something I'm going to address over time as the code generator improves and sequence operations become more efficient.