Friday, August 28, 2009

Performance comparison between Factor and Java on a contrived benchmark

Recently, Joe Groff has been working on struct classes, with the aim of completely replacing Factor's existing support for C structure types. The interesting thing about Joe's struct classes is that unlike the clunky old FFI struct support which operated on byte arrays, struct classes really are Factor classes; instances look and feel like Factor objects, there's literal syntax and prettyprinter support, and structure fields are accessed using accessor words that look exactly like Factor tuple slot accessors.

So whereas the old C structure support didn't have much use outside of the FFI, the new struct classes become a useful and type-safe abstraction in themselves. Tuples are composed of dynamically-typed slots which reference Factor values, and structs are composed of scalar data stored in contiguous memory. This makes them useful in code that otherwise does not use the FFI.

What makes struct classes even more useful in performance-sensitive code is the related struct-arrays vocabulary. This presents a sequence of struct values, stored in contiguous memory, as a Factor sequence. This allows ordinary high-level sequence operations and combinators (map, filter, change-each, ...) to be used on scalar data with a very efficient in-memory representation.

It is interesting that Factor's core is rather high-level and dynamically typed, but various libraries built off to the side using meta-programming facilities implement features useful for systems programming, such as specialized array types, allowing binary pointerless data to be represented and manipulated efficiently. This is unprecedented in dynamic languages, where the usual approach is to farm out performance-intensive work to another language.

The performance difference between, say, an array of pointers to boxed floats, and a raw array of floats, cannot be understated. Further levels of structure make the difference even more dramatic: an array of pointers to objects where each one has three fields pointing to boxed floats (what a mouthful), is considerably less efficient to work with than an array of structures with three float fields each. In the former case, a great many objects are allocated and pointers traversed while working with the data; in the latter case, all data can be stored in one contiguous memory block, that appears as a simple byte array to the garbage collector.

Dynamic languages with advanced JITs, such as the recent JavaScript implementations, hit a performance barrier imposed by their in-memory object representations. Even Java, which has a reputation of being very fast as far as managed, GC'd languages go, suffers here. While Java can pack scalar data into a single instance (so if an object has three float fields, the float data is stored directly in the object); however it does not offer arrays of objects where the objects are stored directly in the array. This negatively impacts performance, as you will see.

There is a buzzword that describes this approach of dealing with data: value semantics. If you find some whiny C++ programmers, soon enough one of them will mumble something about value semantics, and with good reason: because C++ offers value semantics, it often has a performance edge over other programming languages. While production-ready Java and C++ implementations both implement a similar set of optimizations, language semantics contribute to C++ being faster for a lot of code. Of course, C++ is low-level and has unsafe pointers; however, as Factor demonstrates, you can have a high-level managed language that still provides support for value semantics in a safe way.

I decided to whip up a simple benchmark. Here is what it does:
  • It works on points, which are triplets of single-precision floats, (x,y,z)
  • First, the benchmark creates a list of 5000000 points for i=0..4999999, where the ith point is (sin(i),cos(i)*3,sin(i)*sin(i)/2).
  • Then, each point is normalized; the x, y, and z components are divided by sqrt(x*x+y*y+z*z).
  • Finally, the maximum x, y, and z components are found, for all points, and this is printed out.

Note that in-place operations, and re-using temporary objects, is allowed.

Here is the code:

The Factor version is both shorter, and has more blank lines.

Note that the Factor version is intended to be run as a vocabulary from the Factor environment, using the time word, as follows:
[ "benchmark.struct-arrays" run ] time

Run it a few of times so that the data heap can grow to a stable size.

The Java code is self-contained; run it with
java -Xms512m -server silly

The Java version runs the benchmark 8 times, and prints the best time; this gives the HotSpot Server JIT a chance to 'warm up'.

The JVM shipped with Mac OS X 10.5 (build 1.5.0_19-b02-304) runs the benchmark in 4.6 seconds, and the Factor version ran in 2.2 seconds using a Factor build from earlier this evening. I made a couple of further improvements to the Factor compiler, bringing the runtime of the Factor version of the benchmark down to 1.4 seconds in the latest source tree:
  • Added intrinsics for min and max so that when they are applied to values known to be floating point at compile-time, the SSE2 instructions MINSD and MAXSD are used.
  • Added some optimizations to unbox intermediate <displaced-alien> values constructed by the struct array code. A displaced alien is a value representing a pointer and an offset; it is a heap-allocated object with two slots. This is how struct classes internally implement the backing store for objects which are stored 'in the middle' of a byte array.

The Factor code is several layers of abstraction removed from the low-level machine code that is generated in the end. It takes the following optimizations to get good performance out of it:
  • All words declared inline are inlined, all higher-order functions are inlined.
  • Literal quotations (written in square brackets) are inlined at their call sites.
  • Macros, such as sum-outputs are expanded.
  • Sequence words, struct field accessors, and generic math operations are all replaced with lower-level type-specific equivalents using type inference; in many cases, these operations, such as adding floats, map directly to machine instructions
  • The incrementing counter for initializing points is converted into a fixed-precision integer because interval and def-use analysis determine it is safe to do so
  • Escape analysis determines that various temporary objects can be stack-allocated:
    • closures created by various combinators, such as tri-curry, each, and so on
    • the struct array object created with <struct-array>
    • the struct class instances created inside the many calls to each
    • the struct created at the end to store the maximum value
  • Of the remaining memory allocations, those that create small objects are completely inlined, and do not call into the VM at all
  • Stack analysis eliminates most of the runtime data stack manipulation in favor of keeping values in CPU registers
  • Representation analysis figures out that all of the intermediate float values can be unboxed and stored in floating point registers, except for this that live across the sin/cos calls
  • While the sin and cos functions result in calls into libc, sqrt is open-coded as an SSE2 instruction
  • Value numbering eliminates redundant pointer arithmetic and the boxing/unboxing of pointer values
  • GC checks are open-coded and inlined at every basic block which performs an allocation; the GC check compares the nursery pointer against the limit, and in the fast case, where the nursery is not full, no subroutine call is made, and no registers need to be saved and restored
  • The coalescing pass eliminates unnecessary register to register copy instructions that arise from SSA form, as well as ensuring that in most cases, the result of an arithmetic instruction is the same as the first operand; this helps x86 instruction selection
  • The register allocator assigns virtual registers to real CPU registers; in this case there is no spilling at all on x86-64

So what does this say about language expressiveness? Any dynamic language that a) offers the same set of metaprogramming tools as Factor b) has a reasonably advanced JIT could do the same. On the other hand, working value semantics into something like Java is pretty much impossible. I invite you implement this benchmark in your favorite language and share your timings.

34 comments:

  1. I think your Java code is not completely correct. you divide by 2. But 2 is an Integer. You should use float literals instead: http://www.roseindia.net/java/language/java-literals.shtml
    Regards,
    Markus

    ReplyDelete
  2. Anonymous8:13 AM

    Nope, that will be converted automatilly. After all Java is statically typed, right?

    Slava: great job! It's really impressive what you did with factor in this short time (5 years?) and so few people in the core developer team.

    ReplyDelete
  3. This comment has been removed by the author.

    ReplyDelete
  4. Adding value semantics to Java is definitely NOT impossible; in fact, there have been many proposals (and research implementations) for that over the years, the last attempt was John Rose's tuples proposal (http://blogs.sun.com/jrose/entry/tuples_in_the_vm) but this went to the big black hole that killed any large-scale Java language innovations for JDK 7's timeframe.

    Not impressed with Factor code being shorter, though. It's 50 lines long; if I reformat the Java code to be just like Factor (no nested class; stuffing short methods in a single line plus the declaration line; inline temp variables used only once; delete extra timing/printing code that don't exist in the Factor version) I easily make it 46 lines - Factor loses just due to its USING/IN clauses. Example:

    static void benchmark(int len)
    { System.out.println(maxPoints(normalizePoints(makePoints(len)))); }

    This is pretty dense and "functional-looking" style for most Java programmers, but it's exactly like your Factor style.

    And even if the Java code was bulkier (which is often the case - Java is definitely a verbose syntax), this is meaningless in the dynamic vs. static-typed language debate. You can have a static-typed language that's much more terse / high-level. Even Java's teenager cousin, JavaFX Script, will implement this code with much less typing due to its sequences, generators, type inference, etc. For example, makePoints() in JavaFX Script becomes:

    function makePoints(len) { for (i in [0..<len]) init-point(i) }

    (Notice that the language can infer both return and parameter types [the latter in the right circumstances], there is nothing "dynamic" in this code.)

    ReplyDelete
  5. I would be interested to see the results on a more modern JVM. Apple's implementation of HotSpot (particularly in 1.5) is quite antiquated. Try either the latest version of Apple's VM or (even better) OpenJDK (can be installed via MacPorts). I have no doubt that Factor will still come out on top, but I think the results will be closer.

    Incidentally, my own JVM benchmarking (of other things) indicates that 8 probably isn't enough times through the benchmark to get a decent warm up. Try 15 or 20 and then see what the results say.

    ReplyDelete
  6. Anonymous10:46 AM

    Latest factor (git master):
    ( scratchpad ) [ "benchmark.struct-arrays" run ] 20 ave-time
    2013 ms ave run time - 36.98 SD (20 trials)

    # java -version
    java version "1.6.0_0"
    OpenJDK Runtime Environment (IcedTea6 1.5.1) (ArchLinux-1.5.1-1-x86_64)
    OpenJDK 64-Bit Server VM (build 14.0-b15, mixed mode)

    Best time (interestingly #3 of 20, hotspot doesn't need that much time to warm up apparently): 8344 ms

    ReplyDelete
  7. C# has structs with value semantics, and arrays of them are consecutive in memory. They are specifically mentioned to be useful for storing coordinates.

    Can someone port the java code to C# and run on the latest CLR? That would be interesting to see.

    ReplyDelete
  8. Did you delete my comment?
    You should also run this with double. You use double functions in Java, which results several times in conversion to be done to to float, which is a non-trivial operation.
    Markus

    ReplyDelete
  9. using http://paste.factorcode.org/paste?id=839

    Run #0
    csharp.silly+Point
    Time: 1834
    Run #1
    csharp.silly+Point
    Time: 1733
    Run #2
    csharp.silly+Point
    Time: 1651
    Run #3
    csharp.silly+Point
    Time: 1731
    Run #4
    csharp.silly+Point
    Time: 1729
    Run #5
    csharp.silly+Point
    Time: 1728
    Run #6
    csharp.silly+Point
    Time: 1730
    Run #7
    csharp.silly+Point
    Time: 1728
    Run #8
    csharp.silly+Point
    Time: 1736
    Run #9
    csharp.silly+Point
    Time: 1731
    Run #10
    csharp.silly+Point
    Time: 1729
    Run #11
    csharp.silly+Point
    Time: 1727
    Run #12
    csharp.silly+Point
    Time: 1730
    Run #13
    csharp.silly+Point
    Time: 1727
    Run #14
    csharp.silly+Point
    Time: 1727
    Run #15
    csharp.silly+Point
    Time: 1735
    Run #16
    csharp.silly+Point
    Time: 1755
    Run #17
    csharp.silly+Point
    Time: 1736
    Run #18
    csharp.silly+Point
    Time: 1726
    Run #19
    csharp.silly+Point
    Time: 1726

    $ mono --version
    Mono JIT compiler version 2.4.2.3 (tarball Fri Aug 28 18:52:39 IDT 2009)
    Copyright (C) 2002-2008 Novell, Inc and Contributors. www.mono-project.com
    TLS: __thread
    GC: Included Boehm (with typed GC)
    SIGSEGV: altstack
    Notifications: epoll
    Architecture: amd64
    Disabled: none

    on same machine,

    $ java -version
    java version "1.6.0_0"
    OpenJDK Runtime Environment (IcedTea6 1.5.1) (Gentoo)
    OpenJDK 64-Bit Server VM (build 14.0-b15, mixed mode)

    $ java -server -Xms512m silly
    Run #0
    0.8944272, 1.0, 0.4472136
    Time: 8857
    Run #1
    0.8944272, 1.0, 0.4472136
    Time: 9314
    Run #2
    0.8944272, 1.0, 0.4472136
    Time: 9441
    Run #3
    0.8944272, 1.0, 0.4472136
    Time: 8620
    Run #4
    0.8944272, 1.0, 0.4472136
    Time: 8801
    Run #5
    0.8944272, 1.0, 0.4472136
    Time: 9000
    Run #6
    0.8944272, 1.0, 0.4472136
    Time: 8341
    Run #7
    0.8944272, 1.0, 0.4472136
    Time: 9359
    Run #8
    0.8944272, 1.0, 0.4472136
    Time: 8749
    Run #9
    0.8944272, 1.0, 0.4472136
    Time: 8970
    Run #10
    0.8944272, 1.0, 0.4472136
    Time: 8948
    Run #11
    0.8944272, 1.0, 0.4472136
    Time: 9003
    Run #12
    0.8944272, 1.0, 0.4472136
    Time: 9044
    Run #13
    0.8944272, 1.0, 0.4472136
    Time: 8997
    Run #14
    0.8944272, 1.0, 0.4472136
    Time: 8957
    Run #15
    0.8944272, 1.0, 0.4472136
    Time: 8745
    Run #16
    0.8944272, 1.0, 0.4472136
    Time: 9034
    Run #17
    0.8944272, 1.0, 0.4472136
    Time: 8791
    Run #18
    0.8944272, 1.0, 0.4472136
    Time: 9066
    Run #19
    0.8944272, 1.0, 0.4472136
    Time: 8781

    This is a Cure 2 Duo E6550 (2.33Ghz).

    ReplyDelete
  10. Oops, capital T in ToString(). And "override". Result is unaffected, Time: 1726

    ReplyDelete
  11. FWIW, runtime on the CLR 2.0.50727.3082 (i.e. .NET 3.5), rather than Mono, on my machine (Q6600 @ 3.8GHz) looks roughly like this:

    Run #0
    csharp.silly+Point
    Time: 860
    Run #1
    csharp.silly+Point
    Time: 862
    Run #2
    csharp.silly+Point
    Time: 863
    Run #3
    csharp.silly+Point
    Time: 860
    Run #4
    csharp.silly+Point
    Time: 861
    Run #5
    csharp.silly+Point
    Time: 861
    Run #6
    csharp.silly+Point
    Time: 861
    Run #7
    csharp.silly+Point
    Time: 862
    Run #8
    csharp.silly+Point
    Time: 863
    Run #9
    csharp.silly+Point
    Time: 864
    Run #10
    csharp.silly+Point
    Time: 864
    Run #11
    csharp.silly+Point
    Time: 864
    Run #12
    csharp.silly+Point
    Time: 861
    Run #13
    csharp.silly+Point
    Time: 861
    Run #14
    csharp.silly+Point
    Time: 865
    Run #15
    csharp.silly+Point
    Time: 863
    Run #16
    csharp.silly+Point
    Time: 863
    Run #17
    csharp.silly+Point
    Time: 861
    Run #18
    csharp.silly+Point
    Time: 865
    Run #19
    csharp.silly+Point
    Time: 864

    To compare, Java 1.6.0_14 server ran:

    Run #0
    0.8944272, 1.0, 0.4472136
    Time: 8500
    Run #1
    0.8944272, 1.0, 0.4472136
    Time: 8375
    Run #2
    0.8944272, 1.0, 0.4472136
    Time: 8922
    Run #3
    0.8944272, 1.0, 0.4472136
    Time: 8844
    Run #4
    0.8944272, 1.0, 0.4472136
    Time: 8687
    Run #5
    0.8944272, 1.0, 0.4472136
    Time: 8860
    Run #6
    0.8944272, 1.0, 0.4472136
    Time: 8812
    Run #7
    0.8944272, 1.0, 0.4472136
    Time: 8688
    Run #8
    0.8944272, 1.0, 0.4472136
    Time: 8812

    ReplyDelete
  12. It's an impressive result for Factor, but I couldn't countenance programming in an explicitly stack-oriented language for long - I'd be driven up the walls. If I were forced to write in Factor, the first thing I'd do is write a compiler that targeted Factor.

    And that's actually where I find this result most interesting - Factor as a target language for compilers.

    ReplyDelete
  13. Anonymous7:14 PM

    Pretty impressive results for factor.

    I tried it in OCaml in a functional style and I beat java naively.

    Times of 6.5s to 8.8s

    Unfortunately for ocaml if I ran more than 1 benchmark after another it slowed down due to GC issues. Oh well.

    Here's the code
    http://paste.factorcode.org/paste?id=840

    ReplyDelete
  14. I've put up a lisp version here....

    http://paste.factorcode.org/paste?id=841

    SBCL runs it in about 6 seconds which on my machine is about 3 times faster than java but this did require a bit of tuning for speed.

    ReplyDelete
  15. Maybe it's the trig.

    The fastest time for your Java program -

    Run #6
    0.8944272, 1.0, 0.4472136
    Time: 8242

    but if I take the reduce the angle to be within the range of +45 to -45 degrees code from the old partial-sums program then the fastest time is -

    Run #3
    0.8944272, 1.0, 0.4472136
    Time: 4473

    ReplyDelete
  16. My crack at a C version which runs in about 800ms on my system.

    ReplyDelete
  17. if you check out the output with -verbose:gc, you will see that HALF of the time is being consumed by the GC, freeing memory.

    as I ve read before, the JVM JIT compiler can thread some objects as structs some times, allocating them on the stack instead of the heap. I can bet that this will make the JVM run as fast as factor, or even faster.

    @Osvaldo, how could we change the cold so the JIT would do this?

    ReplyDelete
  18. Java 1.6.0_13 on x64 Ubuntu takes 90% of the time just to get the trig values used in Point constructor.

    static void benchmark(int len)
    {
    for(int n = 0; n < len; n++){
    float x = (float)Math.sin(n);
    float y = (float)Math.cos(n) * 3;
    double s = Math.sin(n);
    float z = (float)((s * s) / 2);
    }
    }

    Run #5
    Time: 7402

    ReplyDelete
  19. Here's my Scala version rewritten in a functional style.

    http://paste.factorcode.org/paste?id=845

    It runs slightly faster than Java:

    java -Xmx512m -server -cp ./target/test-classes/:./target/work/webapp/WEB-INF/lib/scala-library-2.7.3.jar bench.factor.Silly
    Run #0
    Point(0.8944272,1.0,0.4472136)
    Time: 7059
    Run #1
    Point(0.8944272,1.0,0.4472136)
    Time: 6968
    Run #2
    Point(0.8944272,1.0,0.4472136)
    Time: 7023
    Run #3
    Point(0.8944272,1.0,0.4472136)
    Time: 7045
    Run #4
    Point(0.8944272,1.0,0.4472136)
    Time: 6997
    Run #5
    Point(0.8944272,1.0,0.4472136)
    Time: 7018
    Run #6
    Point(0.8944272,1.0,0.4472136)
    Time: 7013
    Run #7
    Point(0.8944272,1.0,0.4472136)
    Time: 7013
    Run #8
    Point(0.8944272,1.0,0.4472136)
    Time: 7008


    java -Xmx512m -server -cp ./target/test-classes/ bench.factor.SillyJava
    Run #0
    0.8944272, 1.0, 0.4472136
    Time: 8709
    Run #1
    0.8944272, 1.0, 0.4472136
    Time: 7791
    Run #2
    0.8944272, 1.0, 0.4472136
    Time: 7999
    Run #3
    0.8944272, 1.0, 0.4472136
    Time: 7786
    Run #4
    0.8944272, 1.0, 0.4472136
    Time: 7871
    Run #5
    0.8944272, 1.0, 0.4472136
    Time: 7526
    Run #6
    0.8944272, 1.0, 0.4472136
    Time: 8088
    Run #7
    0.8944272, 1.0, 0.4472136
    Time: 7212
    java -Xmx512m -server -cp ./target/test-classes/ bench.factor.SillyJava
    Run #0
    0.8944272, 1.0, 0.4472136
    Time: 8709
    Run #1
    0.8944272, 1.0, 0.4472136
    Time: 7791
    Run #2
    0.8944272, 1.0, 0.4472136
    Time: 7999
    Run #3
    0.8944272, 1.0, 0.4472136
    Time: 7786
    Run #4
    0.8944272, 1.0, 0.4472136
    Time: 7871
    Run #5
    0.8944272, 1.0, 0.4472136
    Time: 7526
    Run #6
    0.8944272, 1.0, 0.4472136
    Time: 8088
    Run #7
    0.8944272, 1.0, 0.4472136
    Time: 7212

    ReplyDelete
  20. Anonymous4:36 AM

    91% of the time of your Java example is spent in one function: the makePoints() function, or more specifically, the Point constructor. Nearly all of that time is spent in calling sin and cos. Why didn't you bother to check to see if Java's compiler wasn't doing a good job optimizing that? A quick hand optimization:

    final static double HALFPI = Math.PI / 2;
    Point(int n)
    {
    double s = Math.sin(n);
    this.x = (float)s;
    /*Math.cos(n) = Math.sin(Pi/2 - n) */
    this.y = (float)((HALFPI - n) * 3);
    this.z = (float)((s * s) / 2);
    }

    ... and BOOM, the Java example runs at three times its previous speed. All I did was reduce the sin and cos calls to 1.

    This is pretty trivial stuff, geez.

    ReplyDelete
  21. if a language implementation needs weird stupid idioms instead of nice code that goes the way of the language to be fast, there is something seriously wrong

    ReplyDelete
  22. anonymous(abez): an idiomatic OCaml version with records will be over 2X faster, since floats are unboxed, see

    http://paste.factorcode.org/paste?id=846

    Here, this functional OCaml version runs in 3.1 seconds, compared to 8.3s with Java 1.6.0_04-b12 (which should be better than the 1.5 JVM used by Slava).

    ReplyDelete
  23. And here's the straightforward OCaml version of abez' original one with mutation, over 4X faster than the Java one on my box (2.0s with OCaml 3.11.1 vs. 8.3s with Java 1.6.0_04-b12):

    http://paste.factorcode.org/paste?id=847

    ReplyDelete
  24. This comment has been removed by the author.

    ReplyDelete
  25. Sun is aware that trig functions are slow because they don't use the CPU's Floating Point Instructions. They chose not to use them because of cross platform issues and the Intel instructions return wildly inaccurate results in some corner cases:

    http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4857011

    Some people work around this by using native math implementations

    http://forums.sun.com/thread.jspa?threadID=5296773

    ReplyDelete
  26. LazyDude1:40 PM

    The obvious Haskell version of this benchmark will be even shorter and probably as fast as the Factor-version but with the difference that it will consume about 1 MB memory. It Lazy is. Me like.

    ReplyDelete
  27. Anonymous2:01 AM

    Justin Sheer is correct.
    Which is why the Java and Factor version give different outputs. I've converted the java version to use doubles for comparison purposes.

    Java version:
    0.8944271909997111, 1.0, 0.44721359549984097

    Factor version:
    0.8944272994995117, 1.0, 0.4472135007381439

    They differ in the X value (7th decimal) and Z value (8th decimal).

    Speed is good, but I'd rather have my programs be correct than fast.

    ReplyDelete
  28. Anonymous4:30 AM

    I thiknk you should check this out, some dude went and reproiduced your benchmark in clojure: http://ideolalia.com/performance-in-clojure-and-factor

    The result is an order of magnitude better in every way

    ReplyDelete
  29. Anonymous: converting it to use doubles changes the algorithm. The Factor and Java versions of the benchmark use doubles as intermediate values but store the values into the structure as floats. However Factor prints the single floats as doubles, which is why the results print differently. However, they do give the same results as single floats.

    ReplyDelete
  30. Anonymous7:47 AM

    Slava:
    Using the original, unmodified, single-precision floating point Java code I get this result:
    0.8944272, 1.0, 0.4472136

    Factor gives this result:
    0.8944272994995117, 1.0, 0.4472135007381439

    If rounded correctly, this equals
    0.8944273, 1.0, 0.4472135

    Now the discrepancy is in the 7th decimal in both the X and Z values.

    Either there's a bug in the trig implementation or the rounding code in one of the languages. In light of Justin Sheer's comment, it seems unlikely that it is in Java.

    ReplyDelete
  31. Anonymous: I get these results from the Factor benchmark on my computer:

    0.8944271802902222, 1.0, 0.4472135901451111

    I wonder why they're different on yours. What OS/CPU are you running?

    ReplyDelete
  32. Anonymous1:06 AM

    Slava:
    Ubuntu 9.04, 64-bit.
    Intel Q6700.

    I'm using Sun's JVM, version 6 update 16.

    ReplyDelete
  33. Anonymous4:38 AM

    As mentioned, the performance difference is entirely down to java's strict (and slow) trig performance.

    However, there are many fast (if less accurate) alternatives to java.lang.Math.

    In my test, performance went from an avg of ~11500 ms to ~1650 ms using MathX instead of the standard Math.

    It seems this test is mostly benchmarking trig performance.

    ReplyDelete
  34. Anonymous5:01 AM

    Some further optimization improved the performance from 1650 to 1234.
    So a 20 minute optimization session improved the performance from 11500 to 1234 on my machine.
    That is more than 9.3 times as fast!

    Interestingly, this seems to perform better on client than server VM (where performance topped at about 1500ms on my machine).

    ReplyDelete