Saturday, April 19, 2008

Performance improvements

Over the last three days I spent some time improving Factor's compiler. I made the following improvements:
  • Partial dispatch is performed on integer arithmetic operations. Previously, Factor's compiler would convert generic arithmetic to machine arithmetic when it knew the exact types involved; so if you had two floats on the stack, + would become float+, and if you had two fixnums it would become fixnum+ or fixnum+fast, depending on whether interval inference determined if the overflow check was needed or not. While this worked well in many cases there were a lot of instances where the compiler could only infer you were dealing with integers, but not fixnums in particular; either because interval inference was not smart enough, or because the values really could be out of bounds for a fixnum. Now, if it knows that both inputs are integers, it compiles a call to a special word which still performs a dispatch, but with only two possibilities. This is a win, because a conditional is faster than a jump table as used in the generic +. It is an even bigger win if one of the inputs is known to be a fixnum, because then the two jump table dispatches are replaced by a single conditional.
  • Improved overflow check elimination. First, there is an enabling optimization. Suppose we have a positive integer on the stack. Then, the following two are equivalent:
    4096 mod
    4095 bitand

    The optimizer now recognizes the case where mod is applied with a power of two to a positive integer, and converts it to a bitwise and. For the rem word an even more general optimization is possible; we only have to assume the first input is an integer and the second is a power of two.

    While on a modern CPU, this is not a big in itself for mod (it is for rem, which is more expensive), it does enable the following optimization. Note that the following two expressions are equivalent:
    4095 bitand
    16777215 bitand 4095 bitand

    That is, if you mask off the first n bits, then the first m bits, and m<n, you get the same result as masking off the first m bits. This might seem trivial and useless and first, but then you realize that a truncating conversion of a positive bignum to a fixnum is simply masking off bits! So the following two are equivalent:
    4095 bitand
    >fixnum 4095 bitand

    But of course, both inputs to the latter bitand are fixnums now, and can be converted to:
    4095 bitand
    >fixnum 4095 fixnum-bitand

    It gets even better. Consider the following piece of code from project-euler.150:
    : next ( t -- new-t s )
    615949 * 797807 + 20 2^ rem dup 19 2^ - ;

    Constant folding gives us:
    : next ( t -- new-t s )
    615949 * 797807 + 1048576 rem dup 524288 - ;

    The above optimization, together with an overflow removal on the -, gives us:
    : next ( t -- new-t s )
    15949 * 797807 + >fixnum 1048575 fixnum-bitand dup 524288 fixnum-fast ;

    There is an existing optimization takes advantage of these identities:
    * >fixnum == [ >fixnum ] bi@ fixnum*fast
    + >fixnum == [ >fixnum ] bi@ fixnum+fast

    By applying the above, the compiler converts this code to:
    : next ( t -- new-t s )
    >fixnum 15949 fixnum*fast 797807 fixnum+fast 1048575 fixnum-bitand dup 524288 fixnum-fast ;

    All generic arithmetic and overflow checks have been removed, because of a single rem in the middle of the calculation!

    In project-euler.150, this word was actually used inside a loop, where it was iteratively applied to a starting generator value. With the new modular arithmetic optimization, Factor's existing interval inference code managed to infer the result of the word is always in the interval (-2^20,2^20), and since the initial value is 0, even the call to >fixnum was optimized away.
  • I improved Factor's type inference algorithm to better deal with local recursive code blocks. While type inference worked okay with loops, it didn't fare so well with binary recursive functions because it wasn't able to obtain any information about return values of the recursive calls. Everybody's favorite binary recursive benchmark, fibonacci, was one example of this situation. Solving this required a bit of thought.

    Previously, type inference for recursive functions was done in a pretty ad-hoc manner. Factor's type inference is only used for optimization so it is okay if it is conservative. It worked pretty well, however today I decided to add better support for recursion, and I also found a bug where it would infer invalid types, so I decided to redo it a little.

    Type inference of recursive functions is now an iterative process, where you begin by assuming the recursive calls take the top type as inputs and the bottom type as outputs, then you infer the type of the body under these assumptions, then apply the resulting types to the recursive calls, then infer again, and so on, until a fixed point is reached. There are smarter ways of doing this with other type systems however in Factor, the type functions of some words are pretty complex. For example, the output type of + depends on not only the types of its inputs, but the interval ranges they lie in, so I'm not sure if there's any other solution.

    With this out of the way, there is one major remaining limitation in Factor's type inference, and that is it only works within words. It does not attempt to infer types across word boundaries. In practice this means many optimizations only kick in if a lot of words are inlined, which is not practical in some cases due to code size explosion. My next project in this area is to cache type signatures for words and use this information when inferring types of callers.
  • I improved performance of inline object allocation. In the VM, the structure holding the allocation pointer is now stored directly in a global variable, instead of a pointer to it being stored in the global variable. This is one less indirection for inline allocators. Another improvement is that the heap exhaustion check is done in the allocator itself, so a call into the VM is avoided if the heap is not full. This saves a subroutine call in the common case of course, but also some saving and restoring of registers.

These changes have let to some performance improvements on the two benchmarks I was working with. My computer here is a MacBook Pro with a 2.4 GHz Intel Core Duo 2.

The project-euler.150 benchmark saw its runtime decrease from 87 seconds to 22 seconds.

The benchmark.recursive benchmark, which is a Factor implementation of the recursive benchmark from the Computer Language Shootout, saw its runtime decrease from 27 seconds to 9 seconds.

For comparison, SBCL runs the recursive benchmark in 3.5 seconds, and Java runs it in 1.6 seconds.

Using the java -Xint result of 22 seconds, I guesstimated from the results for all languages on the shootout that Factor at around the performance of the Erlang HIPE JIT, and slightly faster than the Python Psyco JIT and GForth. By anybody's standard, this is anywhere between "not very fast" and "bloody slow", but its slowly improving.

Now I'll end with a rant about the language shootout.

The only so-called "dynamic" language implementations which come close to the performance of Java and C on this benchmark are SBCL, Ikarus Scheme, and Chicken Scheme. However, all these benchmarks are actually static programs in disguise, peppered with type declarations and even unsafe low-level features. The Ikarus and Chicken versions even implement the same functions twice, once for integer inputs and using integer arithmetic primitives, and another time for float inputs and using float arithmetic primitives. Unless their goal is really to promote their languages as nothing more than C with s-expression syntax, this is disappointing and dishonest.

The Haskell benchmarks suffer from this also. Many benchmarks seem to use malloc and pointer arithmetic, and exclamation points (strictness annotations) abound. The n-body benchmark contains this gem:
planets :: Ptr Double
planets = unsafePerformIO $ mallocBytes (7 * nbodies * 8) -- sizeOf double = 8

I would expect better from the Haskell people than hardcoding the size of a double to do pointer arithmetic on manually-managed memory!

Right now I have no idea how idiomatic Haskell performs in practice; from looking at these benchmarks, it is clear to me that if one writes C in Haskell, pretty decent performance is possible, but that's not saying much. You can write C in any language.

The runtime of the Factor recursive benchmark further improves by two-fold if I manually replace generic arithmetic with unsafe machine arithmetic, however I'm not interested in writing such code. I don't want to expose low-level primitives to the user, document their use as necessary for performance critical code, and declare that my job is done.

18 comments:

Allen Short said...

Well, in a certain sense you can't write C in any language. The shootout ends up being more about the strength of a language's runtime and what low-level operations it exposes, to be sure. But that's valuable information at some level. It would certainly be nice to see a version that only included idiomatic examples... but I suspect that would be hard to do in a fair manner.

Anonymous said...

Of course idiomatic examples would eskew the results because idiomatic programming is rarely "optimized" programming.

It would then be fair to say that all the code examples where coded fairly idiomatically ... not fairly for optimization. You would expect disparages in the results, perhaps even massive ones.

Anonymous said...

I'd love to see a benchmark of points-free versus the ruby way!

gwenhwyfaer said...

nitpick: is "4096 bitand" a typo for "4095 bitand"? (feel free to delete this comment)

Ken Causey said...

On the one hand I agree with you. Being curious about some language in the past I have gone by the Shootout site to look at some 'real world' source code. In pretty much every case I come away disappointed or downright disgusted.

On the other hand when using any language there are times when code optimization is important (far less often than many think of course, and one should always look for algorithmic optimizations first) and the Shootout in its current form does, in theory, give you some idea of the lower bounds of a language if you find yourself in a must-optimize position.

Slava Pestov said...

gwenhwyfaer: thanks, fixed.

Anonymous said...

SBCL, being an implementation of Common Lisp, includes support for static typing specifically for situations where optimization is important. Yes, most code in most Common Lisp programs isn't written that way, because the optimization it provides isn't worthwhile in those cases, but the ability to use static typing is still very much a part of the language. It exists to be used.

Why should any optimization feature in any language ever be ignored when writing a program with the goal of making it as optimized as possible? It strikes me as even more dishonest to try to abide by some inherently inexact "write optimized programs without optimizing them" rule than to show what the languages can truly do when needed.

Unknown said...

Slava, on the one hand, like you, I'd like to see how each language does on these benchmarks when the programs are written idiomatically.

But, on the other hand, one of the holy grails of programming language design/implementation would be the ability to have a single language that could be used in a high-level styles for most parts of a program and a low level style for the performance-critical portions. In this sense, it is interesting to see how each language can be pushed.

Isaac Gouy said...

Ken Causey said I have gone by the Shootout site to look at some 'real world' source code. In pretty much every case I come away disappointed or downright disgusted.

Whatever "real world" source code means

- What made you think you would find that on the benchmarks game website?

- What makes you think you didn't find that on the benchmarks game website?

Doesn't code get optimized in 'the real world'?

Isaac Gouy said...

Slava Pestov wrote Unless their goal is really to promote their languages as nothing more than C with s-expression syntax, this is disappointing and dishonest.

You are able to make complaints because you are shown the source code.

You may have preferred that the programs were written differently but you have no reason to accuse anyone of dishonesty.

Unknown said...

Nitpick:
16777216 bitand 4095 bitand

First should be 16777215, no?

Slava Pestov said...

Thanks simen, I've fixed it now. I'm glad some people actually read the article. :-)

Anonymous said...

You can write C in any language.

Hmm... And what about writing C in Factor? Can we really increase perfomance of similar benchmarks by ugly, unsafe, hand-coded optimisations? If no, this looks like lack of feature ;-)

Slava Pestov said...

Maxim: yes, you can. Look at the words in math.private (fixnum+, float*, etc) and sequences.private (nth-unsafe, set-nth-unsafe).

Anonymous said...

yes, you can

Slave, it's interesting. Is there any document for the feature?

Anonymous said...

Oh, I'm sorry very much. i found that i mistyped Slava's name. Again i'm sorry.

Anonymous said...

Let me get this straight: if the Scheme standard specifies the primitives + (generic), fx+ (limited to fixnum inputs and result), and fl+ (limited to floating point inputs and result), and somebody comes and uses fx+ or fl+, and it happens that doing so improves performance, where exactly is the dishonesty in that? The Ikarus benchmarks as far as I can tell use no unsafe primitives, and no Ikarus-specific code: they're plain Scheme programs that would run in any R6RS implementation. The fact that Ikarus/Chicken/SBCL/whathaveyou are fast implementations of their respective languages is no reason for you to accuse anybody of dishonesty.

Anonymous said...

Here's a good discussion of getting idiomatic Haskell to be as fast as C.

See also this raytracer.