So whereas the old C structure support didn't have much use outside of the FFI, the new struct classes become a useful and type-safe abstraction in themselves. Tuples are composed of dynamically-typed slots which reference Factor values, and structs are composed of scalar data stored in contiguous memory. This makes them useful in code that otherwise does not use the FFI.
What makes struct classes even more useful in performance-sensitive code is the related struct-arrays vocabulary. This presents a sequence of struct values, stored in contiguous memory, as a Factor sequence. This allows ordinary high-level sequence operations and combinators (map, filter, change-each, ...) to be used on scalar data with a very efficient in-memory representation.
It is interesting that Factor's core is rather high-level and dynamically typed, but various libraries built off to the side using meta-programming facilities implement features useful for systems programming, such as specialized array types, allowing binary pointerless data to be represented and manipulated efficiently. This is unprecedented in dynamic languages, where the usual approach is to farm out performance-intensive work to another language.
The performance difference between, say, an array of pointers to boxed floats, and a raw array of floats, cannot be understated. Further levels of structure make the difference even more dramatic: an array of pointers to objects where each one has three fields pointing to boxed floats (what a mouthful), is considerably less efficient to work with than an array of structures with three float fields each. In the former case, a great many objects are allocated and pointers traversed while working with the data; in the latter case, all data can be stored in one contiguous memory block, that appears as a simple byte array to the garbage collector.
Dynamic languages with advanced JITs, such as the recent JavaScript implementations, hit a performance barrier imposed by their in-memory object representations. Even Java, which has a reputation of being very fast as far as managed, GC'd languages go, suffers here. While Java can pack scalar data into a single instance (so if an object has three float fields, the float data is stored directly in the object); however it does not offer arrays of objects where the objects are stored directly in the array. This negatively impacts performance, as you will see.
There is a buzzword that describes this approach of dealing with data: value semantics. If you find some whiny C++ programmers, soon enough one of them will mumble something about value semantics, and with good reason: because C++ offers value semantics, it often has a performance edge over other programming languages. While production-ready Java and C++ implementations both implement a similar set of optimizations, language semantics contribute to C++ being faster for a lot of code. Of course, C++ is low-level and has unsafe pointers; however, as Factor demonstrates, you can have a high-level managed language that still provides support for value semantics in a safe way.
I decided to whip up a simple benchmark. Here is what it does:
- It works on points, which are triplets of single-precision floats, (x,y,z)
- First, the benchmark creates a list of 5000000 points for i=0..4999999, where the ith point is (sin(i),cos(i)*3,sin(i)*sin(i)/2).
- Then, each point is normalized; the x, y, and z components are divided by sqrt(x*x+y*y+z*z).
- Finally, the maximum x, y, and z components are found, for all points, and this is printed out.
Note that in-place operations, and re-using temporary objects, is allowed.
Here is the code:
The Factor version is both shorter, and has more blank lines.
Note that the Factor version is intended to be run as a vocabulary from the Factor environment, using the
time
word, as follows:[ "benchmark.struct-arrays" run ] time
Run it a few of times so that the data heap can grow to a stable size.
The Java code is self-contained; run it with
java -Xms512m -server silly
The Java version runs the benchmark 8 times, and prints the best time; this gives the HotSpot Server JIT a chance to 'warm up'.
The JVM shipped with Mac OS X 10.5 (build 1.5.0_19-b02-304) runs the benchmark in 4.6 seconds, and the Factor version ran in 2.2 seconds using a Factor build from earlier this evening. I made a couple of further improvements to the Factor compiler, bringing the runtime of the Factor version of the benchmark down to 1.4 seconds in the latest source tree:
- Added intrinsics for
min
andmax
so that when they are applied to values known to be floating point at compile-time, the SSE2 instructions MINSD and MAXSD are used. - Added some optimizations to unbox intermediate
<displaced-alien>
values constructed by the struct array code. A displaced alien is a value representing a pointer and an offset; it is a heap-allocated object with two slots. This is how struct classes internally implement the backing store for objects which are stored 'in the middle' of a byte array.
The Factor code is several layers of abstraction removed from the low-level machine code that is generated in the end. It takes the following optimizations to get good performance out of it:
- All words declared
inline
are inlined, all higher-order functions are inlined. - Literal quotations (written in square brackets) are inlined at their call sites.
- Macros, such as
sum-outputs
are expanded. - Sequence words, struct field accessors, and generic math operations are all replaced with lower-level type-specific equivalents using type inference; in many cases, these operations, such as adding floats, map directly to machine instructions
- The incrementing counter for initializing points is converted into a fixed-precision integer because interval and def-use analysis determine it is safe to do so
- Escape analysis determines that various temporary objects can be stack-allocated:
- closures created by various combinators, such as
tri-curry
,each
, and so on - the struct array object created with
<struct-array>
- the struct class instances created inside the many calls to
each
- the struct created at the end to store the maximum value
- closures created by various combinators, such as
- Of the remaining memory allocations, those that create small objects are completely inlined, and do not call into the VM at all
- Stack analysis eliminates most of the runtime data stack manipulation in favor of keeping values in CPU registers
- Representation analysis figures out that all of the intermediate float values can be unboxed and stored in floating point registers, except for this that live across the sin/cos calls
- While the sin and cos functions result in calls into libc, sqrt is open-coded as an SSE2 instruction
- Value numbering eliminates redundant pointer arithmetic and the boxing/unboxing of pointer values
- GC checks are open-coded and inlined at every basic block which performs an allocation; the GC check compares the nursery pointer against the limit, and in the fast case, where the nursery is not full, no subroutine call is made, and no registers need to be saved and restored
- The coalescing pass eliminates unnecessary register to register copy instructions that arise from SSA form, as well as ensuring that in most cases, the result of an arithmetic instruction is the same as the first operand; this helps x86 instruction selection
- The register allocator assigns virtual registers to real CPU registers; in this case there is no spilling at all on x86-64
So what does this say about language expressiveness? Any dynamic language that a) offers the same set of metaprogramming tools as Factor b) has a reasonably advanced JIT could do the same. On the other hand, working value semantics into something like Java is pretty much impossible. I invite you implement this benchmark in your favorite language and share your timings.
34 comments:
I think your Java code is not completely correct. you divide by 2. But 2 is an Integer. You should use float literals instead: http://www.roseindia.net/java/language/java-literals.shtml
Regards,
Markus
Nope, that will be converted automatilly. After all Java is statically typed, right?
Slava: great job! It's really impressive what you did with factor in this short time (5 years?) and so few people in the core developer team.
Adding value semantics to Java is definitely NOT impossible; in fact, there have been many proposals (and research implementations) for that over the years, the last attempt was John Rose's tuples proposal (http://blogs.sun.com/jrose/entry/tuples_in_the_vm) but this went to the big black hole that killed any large-scale Java language innovations for JDK 7's timeframe.
Not impressed with Factor code being shorter, though. It's 50 lines long; if I reformat the Java code to be just like Factor (no nested class; stuffing short methods in a single line plus the declaration line; inline temp variables used only once; delete extra timing/printing code that don't exist in the Factor version) I easily make it 46 lines - Factor loses just due to its USING/IN clauses. Example:
static void benchmark(int len)
{ System.out.println(maxPoints(normalizePoints(makePoints(len)))); }
This is pretty dense and "functional-looking" style for most Java programmers, but it's exactly like your Factor style.
And even if the Java code was bulkier (which is often the case - Java is definitely a verbose syntax), this is meaningless in the dynamic vs. static-typed language debate. You can have a static-typed language that's much more terse / high-level. Even Java's teenager cousin, JavaFX Script, will implement this code with much less typing due to its sequences, generators, type inference, etc. For example, makePoints() in JavaFX Script becomes:
function makePoints(len) { for (i in [0..<len]) init-point(i) }
(Notice that the language can infer both return and parameter types [the latter in the right circumstances], there is nothing "dynamic" in this code.)
I would be interested to see the results on a more modern JVM. Apple's implementation of HotSpot (particularly in 1.5) is quite antiquated. Try either the latest version of Apple's VM or (even better) OpenJDK (can be installed via MacPorts). I have no doubt that Factor will still come out on top, but I think the results will be closer.
Incidentally, my own JVM benchmarking (of other things) indicates that 8 probably isn't enough times through the benchmark to get a decent warm up. Try 15 or 20 and then see what the results say.
Latest factor (git master):
( scratchpad ) [ "benchmark.struct-arrays" run ] 20 ave-time
2013 ms ave run time - 36.98 SD (20 trials)
# java -version
java version "1.6.0_0"
OpenJDK Runtime Environment (IcedTea6 1.5.1) (ArchLinux-1.5.1-1-x86_64)
OpenJDK 64-Bit Server VM (build 14.0-b15, mixed mode)
Best time (interestingly #3 of 20, hotspot doesn't need that much time to warm up apparently): 8344 ms
C# has structs with value semantics, and arrays of them are consecutive in memory. They are specifically mentioned to be useful for storing coordinates.
Can someone port the java code to C# and run on the latest CLR? That would be interesting to see.
Did you delete my comment?
You should also run this with double. You use double functions in Java, which results several times in conversion to be done to to float, which is a non-trivial operation.
Markus
using http://paste.factorcode.org/paste?id=839
Run #0
csharp.silly+Point
Time: 1834
Run #1
csharp.silly+Point
Time: 1733
Run #2
csharp.silly+Point
Time: 1651
Run #3
csharp.silly+Point
Time: 1731
Run #4
csharp.silly+Point
Time: 1729
Run #5
csharp.silly+Point
Time: 1728
Run #6
csharp.silly+Point
Time: 1730
Run #7
csharp.silly+Point
Time: 1728
Run #8
csharp.silly+Point
Time: 1736
Run #9
csharp.silly+Point
Time: 1731
Run #10
csharp.silly+Point
Time: 1729
Run #11
csharp.silly+Point
Time: 1727
Run #12
csharp.silly+Point
Time: 1730
Run #13
csharp.silly+Point
Time: 1727
Run #14
csharp.silly+Point
Time: 1727
Run #15
csharp.silly+Point
Time: 1735
Run #16
csharp.silly+Point
Time: 1755
Run #17
csharp.silly+Point
Time: 1736
Run #18
csharp.silly+Point
Time: 1726
Run #19
csharp.silly+Point
Time: 1726
$ mono --version
Mono JIT compiler version 2.4.2.3 (tarball Fri Aug 28 18:52:39 IDT 2009)
Copyright (C) 2002-2008 Novell, Inc and Contributors. www.mono-project.com
TLS: __thread
GC: Included Boehm (with typed GC)
SIGSEGV: altstack
Notifications: epoll
Architecture: amd64
Disabled: none
on same machine,
$ java -version
java version "1.6.0_0"
OpenJDK Runtime Environment (IcedTea6 1.5.1) (Gentoo)
OpenJDK 64-Bit Server VM (build 14.0-b15, mixed mode)
$ java -server -Xms512m silly
Run #0
0.8944272, 1.0, 0.4472136
Time: 8857
Run #1
0.8944272, 1.0, 0.4472136
Time: 9314
Run #2
0.8944272, 1.0, 0.4472136
Time: 9441
Run #3
0.8944272, 1.0, 0.4472136
Time: 8620
Run #4
0.8944272, 1.0, 0.4472136
Time: 8801
Run #5
0.8944272, 1.0, 0.4472136
Time: 9000
Run #6
0.8944272, 1.0, 0.4472136
Time: 8341
Run #7
0.8944272, 1.0, 0.4472136
Time: 9359
Run #8
0.8944272, 1.0, 0.4472136
Time: 8749
Run #9
0.8944272, 1.0, 0.4472136
Time: 8970
Run #10
0.8944272, 1.0, 0.4472136
Time: 8948
Run #11
0.8944272, 1.0, 0.4472136
Time: 9003
Run #12
0.8944272, 1.0, 0.4472136
Time: 9044
Run #13
0.8944272, 1.0, 0.4472136
Time: 8997
Run #14
0.8944272, 1.0, 0.4472136
Time: 8957
Run #15
0.8944272, 1.0, 0.4472136
Time: 8745
Run #16
0.8944272, 1.0, 0.4472136
Time: 9034
Run #17
0.8944272, 1.0, 0.4472136
Time: 8791
Run #18
0.8944272, 1.0, 0.4472136
Time: 9066
Run #19
0.8944272, 1.0, 0.4472136
Time: 8781
This is a Cure 2 Duo E6550 (2.33Ghz).
Oops, capital T in ToString(). And "override". Result is unaffected, Time: 1726
FWIW, runtime on the CLR 2.0.50727.3082 (i.e. .NET 3.5), rather than Mono, on my machine (Q6600 @ 3.8GHz) looks roughly like this:
Run #0
csharp.silly+Point
Time: 860
Run #1
csharp.silly+Point
Time: 862
Run #2
csharp.silly+Point
Time: 863
Run #3
csharp.silly+Point
Time: 860
Run #4
csharp.silly+Point
Time: 861
Run #5
csharp.silly+Point
Time: 861
Run #6
csharp.silly+Point
Time: 861
Run #7
csharp.silly+Point
Time: 862
Run #8
csharp.silly+Point
Time: 863
Run #9
csharp.silly+Point
Time: 864
Run #10
csharp.silly+Point
Time: 864
Run #11
csharp.silly+Point
Time: 864
Run #12
csharp.silly+Point
Time: 861
Run #13
csharp.silly+Point
Time: 861
Run #14
csharp.silly+Point
Time: 865
Run #15
csharp.silly+Point
Time: 863
Run #16
csharp.silly+Point
Time: 863
Run #17
csharp.silly+Point
Time: 861
Run #18
csharp.silly+Point
Time: 865
Run #19
csharp.silly+Point
Time: 864
To compare, Java 1.6.0_14 server ran:
Run #0
0.8944272, 1.0, 0.4472136
Time: 8500
Run #1
0.8944272, 1.0, 0.4472136
Time: 8375
Run #2
0.8944272, 1.0, 0.4472136
Time: 8922
Run #3
0.8944272, 1.0, 0.4472136
Time: 8844
Run #4
0.8944272, 1.0, 0.4472136
Time: 8687
Run #5
0.8944272, 1.0, 0.4472136
Time: 8860
Run #6
0.8944272, 1.0, 0.4472136
Time: 8812
Run #7
0.8944272, 1.0, 0.4472136
Time: 8688
Run #8
0.8944272, 1.0, 0.4472136
Time: 8812
It's an impressive result for Factor, but I couldn't countenance programming in an explicitly stack-oriented language for long - I'd be driven up the walls. If I were forced to write in Factor, the first thing I'd do is write a compiler that targeted Factor.
And that's actually where I find this result most interesting - Factor as a target language for compilers.
Pretty impressive results for factor.
I tried it in OCaml in a functional style and I beat java naively.
Times of 6.5s to 8.8s
Unfortunately for ocaml if I ran more than 1 benchmark after another it slowed down due to GC issues. Oh well.
Here's the code
http://paste.factorcode.org/paste?id=840
I've put up a lisp version here....
http://paste.factorcode.org/paste?id=841
SBCL runs it in about 6 seconds which on my machine is about 3 times faster than java but this did require a bit of tuning for speed.
Maybe it's the trig.
The fastest time for your Java program -
Run #6
0.8944272, 1.0, 0.4472136
Time: 8242
but if I take the reduce the angle to be within the range of +45 to -45 degrees code from the old partial-sums program then the fastest time is -
Run #3
0.8944272, 1.0, 0.4472136
Time: 4473
My crack at a C version which runs in about 800ms on my system.
if you check out the output with -verbose:gc, you will see that HALF of the time is being consumed by the GC, freeing memory.
as I ve read before, the JVM JIT compiler can thread some objects as structs some times, allocating them on the stack instead of the heap. I can bet that this will make the JVM run as fast as factor, or even faster.
@Osvaldo, how could we change the cold so the JIT would do this?
Java 1.6.0_13 on x64 Ubuntu takes 90% of the time just to get the trig values used in Point constructor.
static void benchmark(int len)
{
for(int n = 0; n < len; n++){
float x = (float)Math.sin(n);
float y = (float)Math.cos(n) * 3;
double s = Math.sin(n);
float z = (float)((s * s) / 2);
}
}
Run #5
Time: 7402
Here's my Scala version rewritten in a functional style.
http://paste.factorcode.org/paste?id=845
It runs slightly faster than Java:
java -Xmx512m -server -cp ./target/test-classes/:./target/work/webapp/WEB-INF/lib/scala-library-2.7.3.jar bench.factor.Silly
Run #0
Point(0.8944272,1.0,0.4472136)
Time: 7059
Run #1
Point(0.8944272,1.0,0.4472136)
Time: 6968
Run #2
Point(0.8944272,1.0,0.4472136)
Time: 7023
Run #3
Point(0.8944272,1.0,0.4472136)
Time: 7045
Run #4
Point(0.8944272,1.0,0.4472136)
Time: 6997
Run #5
Point(0.8944272,1.0,0.4472136)
Time: 7018
Run #6
Point(0.8944272,1.0,0.4472136)
Time: 7013
Run #7
Point(0.8944272,1.0,0.4472136)
Time: 7013
Run #8
Point(0.8944272,1.0,0.4472136)
Time: 7008
java -Xmx512m -server -cp ./target/test-classes/ bench.factor.SillyJava
Run #0
0.8944272, 1.0, 0.4472136
Time: 8709
Run #1
0.8944272, 1.0, 0.4472136
Time: 7791
Run #2
0.8944272, 1.0, 0.4472136
Time: 7999
Run #3
0.8944272, 1.0, 0.4472136
Time: 7786
Run #4
0.8944272, 1.0, 0.4472136
Time: 7871
Run #5
0.8944272, 1.0, 0.4472136
Time: 7526
Run #6
0.8944272, 1.0, 0.4472136
Time: 8088
Run #7
0.8944272, 1.0, 0.4472136
Time: 7212
java -Xmx512m -server -cp ./target/test-classes/ bench.factor.SillyJava
Run #0
0.8944272, 1.0, 0.4472136
Time: 8709
Run #1
0.8944272, 1.0, 0.4472136
Time: 7791
Run #2
0.8944272, 1.0, 0.4472136
Time: 7999
Run #3
0.8944272, 1.0, 0.4472136
Time: 7786
Run #4
0.8944272, 1.0, 0.4472136
Time: 7871
Run #5
0.8944272, 1.0, 0.4472136
Time: 7526
Run #6
0.8944272, 1.0, 0.4472136
Time: 8088
Run #7
0.8944272, 1.0, 0.4472136
Time: 7212
91% of the time of your Java example is spent in one function: the makePoints() function, or more specifically, the Point constructor. Nearly all of that time is spent in calling sin and cos. Why didn't you bother to check to see if Java's compiler wasn't doing a good job optimizing that? A quick hand optimization:
final static double HALFPI = Math.PI / 2;
Point(int n)
{
double s = Math.sin(n);
this.x = (float)s;
/*Math.cos(n) = Math.sin(Pi/2 - n) */
this.y = (float)((HALFPI - n) * 3);
this.z = (float)((s * s) / 2);
}
... and BOOM, the Java example runs at three times its previous speed. All I did was reduce the sin and cos calls to 1.
This is pretty trivial stuff, geez.
if a language implementation needs weird stupid idioms instead of nice code that goes the way of the language to be fast, there is something seriously wrong
anonymous(abez): an idiomatic OCaml version with records will be over 2X faster, since floats are unboxed, see
http://paste.factorcode.org/paste?id=846
Here, this functional OCaml version runs in 3.1 seconds, compared to 8.3s with Java 1.6.0_04-b12 (which should be better than the 1.5 JVM used by Slava).
And here's the straightforward OCaml version of abez' original one with mutation, over 4X faster than the Java one on my box (2.0s with OCaml 3.11.1 vs. 8.3s with Java 1.6.0_04-b12):
http://paste.factorcode.org/paste?id=847
Sun is aware that trig functions are slow because they don't use the CPU's Floating Point Instructions. They chose not to use them because of cross platform issues and the Intel instructions return wildly inaccurate results in some corner cases:
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4857011
Some people work around this by using native math implementations
http://forums.sun.com/thread.jspa?threadID=5296773
The obvious Haskell version of this benchmark will be even shorter and probably as fast as the Factor-version but with the difference that it will consume about 1 MB memory. It Lazy is. Me like.
Justin Sheer is correct.
Which is why the Java and Factor version give different outputs. I've converted the java version to use doubles for comparison purposes.
Java version:
0.8944271909997111, 1.0, 0.44721359549984097
Factor version:
0.8944272994995117, 1.0, 0.4472135007381439
They differ in the X value (7th decimal) and Z value (8th decimal).
Speed is good, but I'd rather have my programs be correct than fast.
I thiknk you should check this out, some dude went and reproiduced your benchmark in clojure: http://ideolalia.com/performance-in-clojure-and-factor
The result is an order of magnitude better in every way
Anonymous: converting it to use doubles changes the algorithm. The Factor and Java versions of the benchmark use doubles as intermediate values but store the values into the structure as floats. However Factor prints the single floats as doubles, which is why the results print differently. However, they do give the same results as single floats.
Slava:
Using the original, unmodified, single-precision floating point Java code I get this result:
0.8944272, 1.0, 0.4472136
Factor gives this result:
0.8944272994995117, 1.0, 0.4472135007381439
If rounded correctly, this equals
0.8944273, 1.0, 0.4472135
Now the discrepancy is in the 7th decimal in both the X and Z values.
Either there's a bug in the trig implementation or the rounding code in one of the languages. In light of Justin Sheer's comment, it seems unlikely that it is in Java.
Anonymous: I get these results from the Factor benchmark on my computer:
0.8944271802902222, 1.0, 0.4472135901451111
I wonder why they're different on yours. What OS/CPU are you running?
Slava:
Ubuntu 9.04, 64-bit.
Intel Q6700.
I'm using Sun's JVM, version 6 update 16.
As mentioned, the performance difference is entirely down to java's strict (and slow) trig performance.
However, there are many fast (if less accurate) alternatives to java.lang.Math.
In my test, performance went from an avg of ~11500 ms to ~1650 ms using MathX instead of the standard Math.
It seems this test is mostly benchmarking trig performance.
Some further optimization improved the performance from 1650 to 1234.
So a 20 minute optimization session improved the performance from 11500 to 1234 on my machine.
That is more than 9.3 times as fast!
Interestingly, this seems to perform better on client than server VM (where performance topped at about 1500ms on my machine).
Post a Comment