Monday, August 10, 2009

Global float unboxing and some other optimizations

In compiler implementation, "global" optimizations are those that apply to an entire procedure at a time, in contrast to "local" optimizations, which operate on individual basic blocks only. After some work, float unboxing is the latest optimization which is now done globally in Factor.

Factor's float unboxing optimization used to be part of local value numbering. Originally I planned on implementing global value numbering and trying to do float unboxing there, but since then I discovered a better way. I also improved the SSA coalescing pass I talked about last time, and implemented some optimizations which help compile the code in the math.vectors vocabulary more efficiently.

Global float unboxing

Instead of modeling float unboxing as an algebraic simplification problem (where unbox(box(x)) can be rewritten as x) I now model it as a representation problem, ie, the optimizer makes a global decision, for every virtual register, whether to store it as a tagged pointer, or an unboxed float.

When the low-level IR is built, no boxing or unboxing operations are inserted, and the compiler assumes all floating point values will end up unboxed. The new compiler.cfg.representations pass is responsible for inserting just enough boxing and unboxing operations to make the resulting code valid. This pass consists of several step.

Representations and register classes

A representation is like a low-level data type associated to a virtual register. At this stage, possible repesentations are tagged pointers, single-precision floats, and double-precision floats. Note that fixnums (which are stored directly inside a pointer, rather than being heap allocated) are considered as tagged pointers, and that single-precision floats are only used by the FFI. In the future, I will also have a representation type for untagged integers (so that Factor code can perform arithmetic on 32 and 64-bit numbers without boxing them) and various SSE vector types.

A register class is a set of machine registers that instructions can operate on. Every representation has a register class, and values in this representation must be stored in registers of that class. Multiple representations can share a register class; for example, single and double precision floats all go into floating point registers. The register allocator deals with register classes.

When the low-level IR is built, virtual registers do not have representations assigned to them. Instead, this mapping is established later, by the compiler.cfg.representations pass.

First step: compute possible representations for each virtual register

The first step to deciding what representation to use for a virtual register is to take stock of the possibilities. The compiler.cfg.representations.preferred vocabulary defines words which output, for each instruction, the preferred representation of the input values, and the preferred representation of the output value, if any.

For example, the ##slot instruction, which accesses an object's slot, prefers that its inputs and outputs be tagged pointers. On the other hand, ##add-float prefers that its inputs and outputs are unboxed floats (for otherwise, boxing is required), and ##integer>float takes an integer but outputs an unboxed float.

This information is collected into a single mapping, from virtual registers to sequences of representations.

Second step: computing costs for different representations

The next step is to compute the cost of keeping each virtual register in every possible representation for that register. The compiler uses a simple cost heuristic. The cost of keeping a register in a given representation R is obtained by iterating over all instructions that use that register, but expect to receive the value in a different representation from R. For each such instruction, its loop nesting depth is added up. Finally, if the representation R differs from the output representation of the instruction that computed the register, the loop nesting is added to the tally. This gives an approximation of the overhead of boxing and unboxing the value.

To calculate the loop nesting depth of an instruction, I use the standard natural loop detection algorithm. This is implemented in compiler.cfg.loop-detection. The details can be found in any moderately advanced compiler textbook; for each basic block, it identifies a set of loops this basic block belongs to (this set may be empty). A loop consists of a header, a set of more or more end blocks, and a set of blocks making up this loop. Multiple loops can share a header, but each loop can only have a single header because all control flow graphs that the low-level optimizer constructs are reducible (at least I think so).

Third step: representation selection

Once costs have been computed, it is easy to assign to each virtual register the representation that has the least cost. There is one subtlety here though; if an instruction outputs a tagged pointer, no other representation may be used for that virtual register. For example, in this code, it would not be sound to pick an unboxed float representation for the input parameter, since in one branch, it is used as a fixnum, and trying to perform a float unboxing on a fixnum will just crash:
: silly-word ( x -- y )
{ [ dup fixnum? ] [ 1 + ] }
{ [ dup float? ] [ 3 * ] }
} cond ;

The problem here is that in low-level IR, the ##peek instruction which loads the value from the data stack outputs a tagged value, and a priori the compiler cannot assume its going to be a boxed float that can be safely unboxed -- only the output of a float-producing instruction can be unboxed. A more complicated unboxing algorithm based on control-dependence was suggested to me by Daniel Ehrenberg. However, because the low-level IR is single-assignment, the potential overhead from the current unboxing scheme is not so great, and so I'm not worried about it right now.

Fourth step: inserting boxing and unboxing instructions

Once the compiler has determined the representation to use for each virtual register, it must now iterate over all instructions again, this time, inserting conversions whenever a virtual register is used with a representation other than the one that was chosen for it. This pass introduces new temporaries, and renames virtual registers used and defined in instructions, in order to preserve single assignment. Other than that, it is relatively straightforward.

Live-range coalescing revisited

Last time I mentioned that I had implemented a rather complex algorithm for converting out of SSA form and doing copy coalescing at the same time. After having some problems getting the algorithm working in certain corner cases, I decided to scrap it and implement something simpler instead. My new SSA coalescing algorithm uses the classical interference graph approach, except it doesn't actually have to build the interference graph, since interference testing on SSA form can be done in O(1) time.

The register allocator used to do its own coalescing, to eliminate copies inserted by the two-operand conversion pass. The two-operand pass converts
x = y + z

x = y; x = x + z

(and similarly for all other arithmetic operations). This makes arithmetic operations fit the x86 instruction encoding scheme, where the destination register is also the first register. Implementing copy coalescing twice, in two different ways, seemed inelegant to me. A big breakthrough was when I was able to get my SSA coalescing algorithm working on a relaxed SSA form, where values can be reassigned more than once, but only within the same basic block where they are defined. This means that the two-operand conversion pass can run before SSA coalescing, and the register allocator does not need to perform coalescing at all.

The new coalescing algorithm was not noticably slower than the old one (it still benefits from not having to construct an interference graph), the code is much simpler, and the register allocator's copy coalescing code is gone. The compiler feels cleaner as a result.

Improved tuple unboxing

I implemented escape analysis for tuple unboxing last year and recently came up with a way to improve it. Whereas previously, only freshly-allocated tuples could be unboxed, I realized that if an input parameter to a word is known to be an instance of an immutable tuple class, and the input parameter does not escape (ie, not passed to a subroutine call, or returned) then it can be unboxed, too. Whereas unboxing a freshly-allocated tuple involves deleting the allocation call, and replacing slot access with references to the inputs to the allocation call, unboxing an input parameter is a bit trickier. Actual code has to be inserted at the beginning of the word, which unpacks the tuple. This optimization is only a win if the same slot is accessed more than once within a word; in particular, its a win if the slot access occurs inside a loop. This situation came up with specialized arrays; a specialized array is a tuple with two slots, a byte array, and a length. To access the nth element of a specialized array involves an indirection, because the byte array has to be retrieved. By unboxing the specialized array before a loop, a slot access is eliminated from the loop. This gives the high-level optimizer a form of loop-invariant code motion (real loop-invariant code motion might appear in the low-level optimizer at some point, too).

Vector operations on specialized arrays

I used to use hints to make the math.vectors vocabulary faster. While vector words could be used on any sequence type, hints would automatically compile a specialized version of each vector word for the double-array type. This gave a nice speedup on benchmarks which performed vector operations on arrays of double-precision floats, but it didn't help at all for vector operations on other specialized arrays, and there was still at least one runtime dispatch per vector operation, to determine if the generic or hinted version should be used.

While hints have their place, I don't use them for vector words anymore. Instead, I cooked up some special compiler extensions for them (easier than it sounds; the compiler is pretty extensible, at least the pass that I needed to extend for this). Now, every specialized array type compiles a specialized set of vector words, and if the input types of a vector word are known at compile time, the right specialized version is used instead. Of course there's no actual code duplication here, the code is generated at runtime.

Benchmark results

The above improvements, as well as the changes I outlined in my previous post, give a nice speedup since the last time I did some benchmarks.
BenchmarkMay 31stJuly 17Aug 10
benchmark.backtrack 1.767561 1.330641 1.111011
benchmark.base64 1.997951 1.738677 1.536447
benchmark.beust1 2.765257 2.461088 2.360084
benchmark.beust2 3.584958 1.694427 1.691499
benchmark.binary-search 1.55002 1.574595 1.585475
benchmark.binary-trees 1.845798 1.733355 1.771849
benchmark.bootstrap1 10.860492 11.447687 9.962971
benchmark.dawes 0.229999 0.161726 0.104693
benchmark.dispatch1 2.015653 2.119268 2.252401
benchmark.dispatch2 1.817941 1.216618 1.629185
benchmark.dispatch3 2.568987 1.899128 2.692071
benchmark.dispatch4 2.319587 2.032182 2.073563
benchmark.dispatch5 2.346744 1.614045 1.369721
benchmark.empty-loop-0 0.146716 0.12589 0.12608
benchmark.empty-loop-1 0.430314 0.342426 0.351241
benchmark.empty-loop-2 0.429012 0.342097 0.350182
benchmark.euler150 16.901451 15.288867 13.046828
benchmark.euler186 8.8054349999999997.920478 7.997483
benchmark.fannkuch 3.202698 2.964037 2.859117
benchmark.fasta 5.52608 4.934112 5.135706
benchmark.gc0 2.15066 1.993158 2.03638
benchmark.gc1 4.984841 4.961272 5.075487
benchmark.gc2 3.327706 3.265462 3.376618
benchmark.iteration 3.736756 3.30438 3.22603
benchmark.javascript 9.79904 9.164517 9.000165000000001
benchmark.knucleotide 0.282296 0.251879 0.231547
benchmark.mandel 0.125304 0.123945 0.123321
benchmark.md5 0.946516 0.85062 0.84516
benchmark.nbody 3.982774 3.349595 2.204085
benchmark.nested-empty-loop-10.116351 0.135936 0.053609
benchmark.nested-empty-loop-20.692668 0.438932 0.390032
benchmark.nsieve 0.714772 0.698262 0.694443
benchmark.nsieve-bits 1.451828 0.907247 0.727394
benchmark.nsieve-bytes 0.312481 0.300053 0.225218
benchmark.partial-sums 1.205072 1.221245 1.152942
benchmark.random 2.574773 2.706893 2.403702
benchmark.raytracer 3.481714 2.914116 2.428384
benchmark.recursive 5.964279 3.215277 3.178855
benchmark.regex-dna 0.132406 0.093095 0.08619400000000001
benchmark.reverse-complement 3.811822 3.257535 3.003156
benchmark.ring 1.756481 1.79823 1.696271
benchmark.sha1 2.267648 1.473887 1.43003
benchmark.sockets 8.7942800000000018.783398 8.643011
benchmark.sort 0.421628 0.363383 0.367602
benchmark.spectral-norm 3.830249 4.036353 3.277558
benchmark.stack 2.086594 1.014408 0.91959
benchmark.sum-file 0.528061 0.422194 0.383183
benchmark.tuple-arrays 0.127335 0.103421 0.08908199999999999
benchmark.typecheck1 0.876559 0.6723440000000001 0.744205
benchmark.typecheck2 0.878561 0.671624 0.770056
benchmark.typecheck3 0.86596 0.670099 0.699071
benchmark.ui-panes 0.426701 0.372301 0.369152
benchmark.xml 2.351934 2.187999 1.630145


spaans in spanje said...

That's great analysis about unboxing..You described better about optimization techniques.

George said...

What is the speed comparison to SBCL.

Sprachaufenthalt England said...

Ya George..

Great comparison with good explanation.

Jeremy Jones said...

Three and 4 posts i have went through on this blog but only here i want to comment coz of the material on the post. Great skills of comparison..

IELTS ireland