Saturday, May 29, 2010

Comparing Factor's performance against V8, LuaJIT, SBCL, and CPython

Together with Daniel Ehrenberg and Joe Groff, I'm writing a paper about Factor for DLS2010. We would appreciate feedback about the draft version of the paper. As part of the paper we include a performance comparison between Factor, V8, LuaJIT, SBCL, and Python. The performance comparison consists of some benchmarks from the The Computer Language Benchmarks Game. I'm posting the results here first, in case there's something really stupid here.

Language implementations

Factor and V8 were built from their respective repositories. SBCL is version 1.0.38. LuaJIT is version 2.0.0beta4. CPython is version 3.1.2. All language implementations were built as 64-bit binaries and run on an 2.4 GHz Intel Core 2 Duo.

Benchmark implementations

Factor implementations of the benchmarks can be found in our source repository:

Implementations for the other languages can be found at the language benchmark game CVS repository:

LuaJITSBCLV8CPython
binary-trees binarytrees.lua-2.luabinarytrees.sbcl binarytrees.javascript binarytrees.python3-6.python3
fasta fasta.lua fasta.sbcl fasta.javascript-2.javascript fasta.python3-2.python3
knucleotide knucleotide.lua-2.luaknucleotide.sbcl-3.sbcl knucleotide.javascript-3.javascriptknucleotide.python3-4.python3
nbody nbody.lua-2.lua nbody.sbcl nbody.javascript nbody.python3-4.python3
regex-dna regexdna.sbcl-3.sbcl regexdna.javascript regexdna.python3
reverse-complement revcomp.lua revcomp.sbcl revcomp.javascript-2.javascript revcomp.python3-4.python3
spectral-norm spectralnorm.lua spectralnorm.sbcl-3.sbclspectralnorm.javascript spectralnorm.python3-5.python3

In order to make the reverse complement benchmark work with SBCL on Mac OS X, I had to apply this patch; I don't understand why:

--- bench/revcomp/revcomp.sbcl 9 Feb 2007 17:17:26 -0000 1.4
+++ bench/revcomp/revcomp.sbcl 29 May 2010 08:32:19 -0000
@@ -26,8 +26,7 @@

(defun main ()
(declare (optimize (speed 3) (safety 0)))
- (with-open-file (in "/dev/stdin" :element-type +ub+)
- (with-open-file (out "/dev/stdout" :element-type +ub+ :direction :output :if-exists :append)
+ (let ((in sb-sys:*stdin*) (out sb-sys:*stdout*))
(let ((i-buf (make-array +buffer-size+ :element-type +ub+))
(o-buf (make-array +buffer-size+ :element-type +ub+))
(chunks nil))
@@ -72,4 +71,4 @@
(setf start 0)
(go read-chunk))))
end-of-input
- (flush-chunks)))))))
+ (flush-chunks))))))

Running the benchmarks

I used Factor's deploy tool to generate minimal images for the Factor benchmarks, and then ran them from the command line:

./factor -e='USE: tools.deploy "benchmark.nbody-simd" deploy'
time benchmark.nbody-simd.app/Contents/MacOS/benchmark.nbody-simd

For the scripting language implementations (LuaJIT and V8) I ran the scripts from the command line:

time ./d8 ~/perf/shootout/bench/nbody/nbody.javascript -- 1000000
time ./src/luajit ~/perf/shootout/bench/nbody/nbody.lua-2.lua 1000000

For SBCL, I did what the shootout does, and compiled each file into a new core:

ln -s ~/perf/shootout/bench/nbody/nbody.sbcl .

cat > nbody.sbcl_compile <<EOF
(proclaim '(optimize (speed 3) (safety 0) (debug 0) (compilation-speed 0) (space 0)))
(handler-bind ((sb-ext:defconstant-uneql (lambda (c) (abort c))))
(load (compile-file "nbody.sbcl" )))
(save-lisp-and-die "nbody.core" :purify t)
EOF

sbcl --userinit /dev/null --load nbody.sbcl_compile

cat > nbody.sbcl_run <<EOF
(proclaim '(optimize (speed 3) (safety 0) (debug 0) (compilation-speed 0) (space 0)))
(main) (quit)
EOF

time sbcl --dynamic-space-size 500 --noinform --core nbody.core --userinit /dev/null --load nbody.sbcl_run 1000000

For CPython, I precompiled each script into bytecode first:

python3.1 -OO -c "from py_compile import compile; compile('nbody.python3-4.py')"

Benchmark results

All running times are wall clock time from the Unix time command. I ran each benchmark 5 times and used the best result.

FactorLuaJITSBCL V8 CPython
fasta 2.597s1.689s2.105s3.948s 35.234s
reverse-complement2.377s1.764s2.955s3.884s 1.669s
nbody 0.393s0.604s0.402s4.569s 37.086s
binary-trees 1.764s6.295s1.349s2.119s 19.886s
spectral-norm 1.377s1.358s2.229s12.227s1m44.675s
regex-dna 0.990sN/A 0.973s0.166s 0.874s
knucleotide 1.820s0.573s0.766s1.876s 1.805s

Benchmark analysis

Some notes on the results:

  • There is no Lua implementation of the regex-dna benchmark.
  • Some of the SBCL benchmark implementations can make use of multiple cores if SBCL is compiled with thread support. However, by default, thread support seems to be disabled on Mac OS X. None of the other language implementations being tested have native thread support, so this is a single-core performance test.
  • Factor's string manipulation still needs work. The fasta, knucleotide and reverse-complement benchmarks are not as fast as they should be.
  • The binary-trees benchmark is a measure of how fast objects can be allocated, and how fast the garbage collector can reclaim dead objects. LuaJIT loses big here, perhaps because it lacks generational garbage collection, and because Lua's tables are an inefficient object representation.
  • The regex-dna benchmark is a measure of how efficient the regular expression implementation is in the language. V8 wins here, because it uses Google's heavily-optimized Irregexp library.
  • Factor beats the other implementations on the nbody benchmark because it is able to make use of SIMD.
  • For some reason SBCL is slower than the others on spectral-norm. It should be generating the same code.
  • The benchmarks exercise insufficiently-many language features. Any benchmark that uses native-sized integers (for example, an implementation of the SHA1 algorithm) would shine on SBCL and suffer on all the others. Similarly, any benchmark that requires packed binary data support would shine on Factor and suffer on all the others. However, the benchmarks in the shootout mostly consist of scalar floating point code, and text manipulation only.

Conclusions

Factor's performance is coming along nicely. I'd like to submit Factor to the computer language shootout soon. Before doing that, we need a Debian package, and the deploy tool needs to be easier to use from the command line.

16 comments:

dh said...

For DLS2010, some might be interested to see a performance comparison with Ruby and/or Python. Any data on those?

Slava Pestov said...

dh: good point. I will run the benchmarks in Python tomorrow and update this blog post.

Andy Hefner said...

Odd. When I last hacked on the spectral-norm implementation for SBCL, it was tying with C, Fortran, and all the rest of those in performance. That was close to two years ago; perhaps there's been some regression in the compiler.

Anonymous said...

Perhaps include a pypy comparison (but you already have python which I guess is more meaningful)?

Unknown said...

Pretty much meaningless comparison.

Out of curiosity I took a look at spectral-norm written in Factor at noticed [as soon as my eyes adopted to it] that you are using unboxed arrays (I would not be surprised if compiler had a special support for such arrays). The same is probably true for tuples used in binary-trees benchmark. It appears that you are giving hits to the JIT [I interpret ; inline things like hints, maybe I am wrong]

It also seems to me that Factor itself has a simpler and stricter scoping rules if compared to other JavaScript, Python or (probably) even Lua.

Given all this tricks in the sleeve and given semantic differences it would be just surprising to see slower execution.

Unknown said...

Ouch. I just looked at the nbody benchmark sources.

Please correct me if I am wrong: you are declaring a structure there with typed fields while other dynamic languages use dictionaries to store bodies?

Anonymous said...

Maybe you should use proper statistics rather than taking the fastest time possible. Like run it enough times to do a t-test to show the significance of the difference between measurements.

You say you're the world's greatest systems programmer, but all you academics know how to do is prove academic theorems about the module substructure over nilpotent Lie algebras, and even then you can't manage to generalize it past the two-step case. Your method fails to scale, because in the real world, nilpotent Lie algebras can take as many as ten steps to get to zero. Just like your statistics.

Maciej Fijalkowski said...

Hey. Can you make clear in your comparison that you're comparing CPython and not Python? (Python is a language, CPython is an actual VM).

Also, you don't need a debian package of factor to deploy it on computer language shootout.

Isaac Gouy said...

"computer language shootout"

Please use the title shown on the website - The Computer Language Benchmarks Game

Isaac Gouy said...

> For Python, I precompiled each script into bytecode first

Did you check if doing so gave faster or slower timings?

Isaac Gouy said...

> Implementations for the other languages can be found at ...

The benchmarks game website often shows more than one program per language per benchmark - how is anyone to know which programs you timed?

Anonymous said...

multiprocessing in Python? For what? O_o

Isaac Gouy said...
This comment has been removed by the author.
Isaac Gouy said...
This comment has been removed by the author.
Isaac Gouy said...

> can be found at the language benchmark game CVS repository

The column headings seem to have gotten mixed up - they don't match the languages linked to in the cells.

Maybe some redundancy could be removed from the table - the column of benchmark names?


> Benchmark results

The table contents seem completely disorganised.

Easier to read if there's some order imposed:

- pick a language, and sort the benchmark rows on the time taken for that language
- pick the benchmark program that took the longest or shortest time for that language, and sort the language cols on the time taken for that benchmark

Matthew Wilson said...

the links to the javascript program implementations are several links to the several-years-old editions; some are much-improved since then. This makes me wonder whether your test used these older/slower editions.