Saturday, August 29, 2009

Struct arrays benchmark revisited: trig function calls are slow in Java, but without them Factor is still 3x faster

My struct arrays benchmark generated a fair amount of discussion on reddit, with some people disputing the benchmark's validity. Certainly I'm not claiming that Factor is faster than Java HotSpot in general (on most tasks it is slower, sometimes much more so), however I think the benchmark legitimately demonstrates the performance advantage of value semantics over reference semantics.

A few people pointed out that the Java version of the benchmark was spending a lot of time in trigonometric functions, and that Java computes sin and cosine "by hand" rather than using x87 FPU instructions. However, the same is also true of the Factor implementation, its just that none of the Java advocates bothered to check. I don't use x87 instructions either, and call into libc for sin and cos, just like Java. Indeed, Factor's sin and cos are even more heavyweight than Java's, because they also support complex numbers; Factor's compiler converts them to their real-valued equivalents if it can prove the inputs are real.

Despite this, I modified the benchmark to not call trig functions, and instead just initialize each point as (n+1,n+2,(n+3)*(n+3)/2). Factor is still faster by roughly the same margin as before:
Java829ms (best run out of 8)
Factor284ms

A few people also mentioned that by only running the test 8 times I wasn't giving HotSpot enough time to "warm up". However, the benchmark does not make any polymorphic calls in the inner loops, so there's really no "warm up" needed at all; and indeed I got the best time on the third iteration.

I improved Factor's code generation for trigonometric function calls. Trig function calls are treated like instructions by the low-level optimizer now, which means they participate in value numbering and do not split basic blocks. Instead, the register allocator spills all live registers at the call site at the very end of code generation. This strategy does not work in general for all FFI calls, because the case where the FFI calls invokes a Factor callback needs additional runtime support. However, neither sin nor cos do this. After implementing this, I noticed that my struct-arrays benchmark was calling sin() twice on the same input, and value numbering was folding the second call away. Neat optimization to get "for free".

This code generation change improves performance of both the struct-arrays and partial-sums benchmarks:
BenchmarkBeforeAfter
struct-arrays1483ms749ms
partial-sums1233ms938ms

Here is the disassembly for benchmark.struct-arrays with this optimization performed, and this is the patch.

For what its worth, Java HotSpot runs the partial-sums benchmark in 1293ms. Hopefully HotSpot's trig function call performance will receive some attention at some point.

5 comments:

Isaac Gouy said...

> However, the same is also true of the Factor implementation, its just that none of the Java advocates bothered to check.

But perhaps not the same for those implementations in other languages that you solicited ;-)


> A few people also mentioned that by only running the test 8 times I wasn't giving HotSpot enough time to "warm up". However, the benchmark does not make any polymorphic calls in the inner loops...

But it might not steady until Run #12 - spend two whole minutes and run it 100 times!

Run #9
[GC 195805K->195589K(566272K), 0.4080990 secs]
[GC 306629K->306899K(566272K), 0.3254310 secs]
[Full GC 306899K->111173K(574848K), 0.2963310 secs]
0.7071067, 0.7071122, 1.0
Time: 1497
Run #10
[GC 195738K->195589K(574848K), 0.4653980 secs]
[GC 306629K->306908K(574848K), 0.3698520 secs]
0.7071067, 0.7071122, 1.0
Time: 1352
Run #11
[GC 391433K->195877K(574848K), 0.0138990 secs]
[GC 306917K->306996K(574848K), 0.1877960 secs]
0.7071067, 0.7071122, 1.0
Time: 671
Run #12
[GC 391498K->195965K(574848K), 0.0212990 secs]
[GC 307005K->307092K(574848K), 0.1878710 secs]
0.7071067, 0.7071122, 1.0
Time: 675
Run #13

Justin Sher said...

I redid the Scala port without trig:

http://paste.factorcode.org/paste?id=850

Here's the run time for Scala:

java -Xmx512m -server -cp ./target/test-classes/:./target/work/webapp/WEB-INF/lib/scala-library-2.7.3.jar bench.factor.Silly | grep Time

Time: 600
Time: 442
Time: 454
Time: 455
Time: 468
Time: 465
Time: 464
Time: 464
Time: 465

Here's the run times for Java without Trig:

java -Xmx512m -server -cp ./target/test-classes/ bench.factor.SillyJava | grep Time

Time: 1816
Time: 1623
Time: 1334
Time: 1196
Time: 1190
Time: 1299
Time: 1384
Time: 1416

Seems like a big win for Scala IMHO as it's about 2.7x faster than Java. I did write the Scala version in a functional programming style, thus it's not a pure apples to apples comparison.

Twan van Laarhoven said...

> This strategy does not work in general for all FFI calls, because the case where the FFI calls invokes a Factor callback needs additional runtime support.

The Haskell FFI allows for a declaration to indicate that an imported function does not invoke a callback. Perhaps something similar could be used for factor?

Slava Pestov said...

Twan van Laarhoven: yes, that's the plan.

Unknown said...

just a side note, I like this: "Java is not slower because you are faster, we are slower because it's unfair" :)