- Partial dispatch is performed on integer arithmetic operations. Previously, Factor's compiler would convert generic arithmetic to machine arithmetic when it knew the exact types involved; so if you had two floats on the stack,
float+, and if you had two fixnums it would become
fixnum+fast, depending on whether interval inference determined if the overflow check was needed or not. While this worked well in many cases there were a lot of instances where the compiler could only infer you were dealing with integers, but not fixnums in particular; either because interval inference was not smart enough, or because the values really could be out of bounds for a fixnum. Now, if it knows that both inputs are integers, it compiles a call to a special word which still performs a dispatch, but with only two possibilities. This is a win, because a conditional is faster than a jump table as used in the generic
+. It is an even bigger win if one of the inputs is known to be a fixnum, because then the two jump table dispatches are replaced by a single conditional.
- Improved overflow check elimination. First, there is an enabling optimization. Suppose we have a positive integer on the stack. Then, the following two are equivalent:
The optimizer now recognizes the case where mod is applied with a power of two to a positive integer, and converts it to a bitwise and. For the
remword an even more general optimization is possible; we only have to assume the first input is an integer and the second is a power of two.
While on a modern CPU, this is not a big in itself for
mod(it is for
rem, which is more expensive), it does enable the following optimization. Note that the following two expressions are equivalent:
16777215 bitand 4095 bitand
That is, if you mask off the first
nbits, then the first
m<n, you get the same result as masking off the first
mbits. This might seem trivial and useless and first, but then you realize that a truncating conversion of a positive bignum to a fixnum is simply masking off bits! So the following two are equivalent:
>fixnum 4095 bitand
But of course, both inputs to the latter
bitandare fixnums now, and can be converted to:
>fixnum 4095 fixnum-bitand
It gets even better. Consider the following piece of code from
: next ( t -- new-t s )
615949 * 797807 + 20 2^ rem dup 19 2^ - ;
Constant folding gives us:
: next ( t -- new-t s )
615949 * 797807 + 1048576 rem dup 524288 - ;
The above optimization, together with an overflow removal on the
-, gives us:
: next ( t -- new-t s )
15949 * 797807 + >fixnum 1048575 fixnum-bitand dup 524288 fixnum-fast ;
There is an existing optimization takes advantage of these identities:
* >fixnum == [ >fixnum ] bi@ fixnum*fast
+ >fixnum == [ >fixnum ] bi@ fixnum+fast
By applying the above, the compiler converts this code to:
: next ( t -- new-t s )
>fixnum 15949 fixnum*fast 797807 fixnum+fast 1048575 fixnum-bitand dup 524288 fixnum-fast ;
All generic arithmetic and overflow checks have been removed, because of a single
remin the middle of the calculation!
project-euler.150, this word was actually used inside a loop, where it was iteratively applied to a starting generator value. With the new modular arithmetic optimization, Factor's existing interval inference code managed to infer the result of the word is always in the interval
(-2^20,2^20), and since the initial value is
0, even the call to
>fixnumwas optimized away.
- I improved Factor's type inference algorithm to better deal with local recursive code blocks. While type inference worked okay with loops, it didn't fare so well with binary recursive functions because it wasn't able to obtain any information about return values of the recursive calls. Everybody's favorite binary recursive benchmark, fibonacci, was one example of this situation. Solving this required a bit of thought.
Previously, type inference for recursive functions was done in a pretty ad-hoc manner. Factor's type inference is only used for optimization so it is okay if it is conservative. It worked pretty well, however today I decided to add better support for recursion, and I also found a bug where it would infer invalid types, so I decided to redo it a little.
Type inference of recursive functions is now an iterative process, where you begin by assuming the recursive calls take the top type as inputs and the bottom type as outputs, then you infer the type of the body under these assumptions, then apply the resulting types to the recursive calls, then infer again, and so on, until a fixed point is reached. There are smarter ways of doing this with other type systems however in Factor, the type functions of some words are pretty complex. For example, the output type of
+depends on not only the types of its inputs, but the interval ranges they lie in, so I'm not sure if there's any other solution.
With this out of the way, there is one major remaining limitation in Factor's type inference, and that is it only works within words. It does not attempt to infer types across word boundaries. In practice this means many optimizations only kick in if a lot of words are inlined, which is not practical in some cases due to code size explosion. My next project in this area is to cache type signatures for words and use this information when inferring types of callers.
- I improved performance of inline object allocation. In the VM, the structure holding the allocation pointer is now stored directly in a global variable, instead of a pointer to it being stored in the global variable. This is one less indirection for inline allocators. Another improvement is that the heap exhaustion check is done in the allocator itself, so a call into the VM is avoided if the heap is not full. This saves a subroutine call in the common case of course, but also some saving and restoring of registers.
These changes have let to some performance improvements on the two benchmarks I was working with. My computer here is a MacBook Pro with a 2.4 GHz Intel Core Duo 2.
project-euler.150benchmark saw its runtime decrease from 87 seconds to 22 seconds.
benchmark.recursivebenchmark, which is a Factor implementation of the recursive benchmark from the Computer Language Shootout, saw its runtime decrease from 27 seconds to 9 seconds.
For comparison, SBCL runs the recursive benchmark in 3.5 seconds, and Java runs it in 1.6 seconds.
java -Xintresult of 22 seconds, I guesstimated from the results for all languages on the shootout that Factor at around the performance of the Erlang HIPE JIT, and slightly faster than the Python Psyco JIT and GForth. By anybody's standard, this is anywhere between "not very fast" and "bloody slow", but its slowly improving.
Now I'll end with a rant about the language shootout.
The only so-called "dynamic" language implementations which come close to the performance of Java and C on this benchmark are SBCL, Ikarus Scheme, and Chicken Scheme. However, all these benchmarks are actually static programs in disguise, peppered with type declarations and even unsafe low-level features. The Ikarus and Chicken versions even implement the same functions twice, once for integer inputs and using integer arithmetic primitives, and another time for float inputs and using float arithmetic primitives. Unless their goal is really to promote their languages as nothing more than C with s-expression syntax, this is disappointing and dishonest.
The Haskell benchmarks suffer from this also. Many benchmarks seem to use malloc and pointer arithmetic, and exclamation points (strictness annotations) abound. The n-body benchmark contains this gem:
planets :: Ptr Double
planets = unsafePerformIO $ mallocBytes (7 * nbodies * 8) -- sizeOf double = 8
I would expect better from the Haskell people than hardcoding the size of a double to do pointer arithmetic on manually-managed memory!
Right now I have no idea how idiomatic Haskell performs in practice; from looking at these benchmarks, it is clear to me that if one writes C in Haskell, pretty decent performance is possible, but that's not saying much. You can write C in any language.
The runtime of the Factor recursive benchmark further improves by two-fold if I manually replace generic arithmetic with unsafe machine arithmetic, however I'm not interested in writing such code. I don't want to expose low-level primitives to the user, document their use as necessary for performance critical code, and declare that my job is done.