Friday, July 17, 2009

Improved value numbering, branch splitting, and faster overflow checks

Hot on the heels on the improved register allocator, I've added some new optimizations to the low-level optimizer and made existing optimizations stronger.

Improved value numbering

Local value numbering now detects more congruences along values computed by low level IR instructions. All the changes were either in the rewrite or simplify steps.
  • Some additional arithmetic identities. For instance, the result of adding or subtracting 0 to a value is congruent to the value itself. The high-level optimizer implements the same optimization on Factor's generic arithmetic words, but sometimes low-level optimizations introduce additional redundancies.
  • A ##add instruction where one of the two operands is the result of a ##load-immediate is now converted into a ##add-imm, saving a register. A similar identity holds for other commutative operations, such as ##mul, ##and, ##or, ##xor, and ##compare with a condition code of cc=.
  • Constant folding. Arithmetic instructions where both operands are constants are replaced by a ##load-immediate containing the result. A special case is a subtraction where both operands are congruent; this is replaced by a load of 0. Again, the high-level optimizer performs constant folding also, but low-level optimizations sometimes introduce redundancies or expose opportunities for optimization which were not apparent in high-level IR.
  • Reassociation. If an ##add-imm instruction's first input is the result of another ##add-imm, then a new ##add-imm is made which computes the same result with a single addition. Algebraically, this corresponds to the identity (x + a) + b = x + (a + b), where a and b are known at compile-time. Similar transformations are made for other associative operations, such as ##sub, ##mul, ##and, ##or and ##xor.
  • Constant branch folding. If both inputs to a ##compare or ##compare-branch are constant, or if both are congruent to the same value, then the result of the comparison can be computed at compile-time, and one of the two successor blocks can be deleted.

While the high-level optimizer performs constant folding and unreachable code elimination as part of the SCCP pass, low-level optimizations which are done before value numbering can expose optimization opportunities which were invisible in high-level IR. For example, consider the following code:
[ { vector } declare [ length ] [ length ] bi < ]
Using the optimized. word, we can see what the high-level optimizer transforms this into:
[ dup >R 3 slot R> 3 slot fixnum< ]

The high-level optimizer only attempts to reason about immutable slots, but a vector's length is mutable since a vector may have elements added to it, and so the high-level optimizer cannot detect that the two inputs to fixnum< are equal. A conversion into low-level IR, the code becomes the following sequence of SSA instructions:
_label 0 f 
##prologue f
_label 1 f
##peek V int-regs 1 D 0 f
##copy V int-regs 4 V int-regs 1 f
##slot-imm V int-regs 5 V int-regs 4 3 7 f
##copy V int-regs 8 V int-regs 1 f
##slot-imm V int-regs 9 V int-regs 8 3 7 f
##copy V int-regs 10 V int-regs 5 f
##copy V int-regs 11 V int-regs 9 f
##compare V int-regs 12 V int-regs 10 V int-regs 11 cc< V int-regs 13 f
##replace V int-regs 12 D 0 f
##epilogue f
##return f
_spill-counts f f

Now, the low-level alias analysis optimization is able to detect that the second ##slot-imm is redundant, and it transforms the code into the following:
_label 0 f 
##prologue f
_label 1 f
##peek V int-regs 1 D 0 f
##copy V int-regs 4 V int-regs 1 f
##slot-imm V int-regs 5 V int-regs 4 3 7 f
##copy V int-regs 9 V int-regs 5 f
##copy V int-regs 10 V int-regs 5 f
##copy V int-regs 11 V int-regs 9 f
##compare V int-regs 12 V int-regs 10 V int-regs 11 cc< V int-regs 13 f
##replace V int-regs 12 D 0 f
##epilogue f
##return f
_spill-counts f f

Now, notice that both inputs to the ##compare instruction (vreg #10 and vreg #11) are copies of the same value, vreg #5. Value numbering detects this congruence since copy propagation is just a special case of value numbering; it then simplifies the comparison down to a constant load of f, since for any integer x, we have x < x => false. After dead code elimination, the result is the following:
_label 0 f 
##prologue f
_label 1 f
##load-immediate V int-regs 12 5 f
##replace V int-regs 12 D 0 f
##epilogue f
##return f
_spill-counts f f

While no programmer would explicitly write such redundant code, it comes up after inlining. For example, the push word, which adds an element at the end of a sequence, is implemented as follows:
: push ( elt seq -- ) [ length ] [ set-nth ] bi ;

HINTS: push { vector } { sbuf } ;

The hints tell the compiler to compile specialized versions of this word for vectors and string buffers. While the generic versions work on any sequence, the hinted versions are used when the input types match, and the result can be better performance if generic words are inlined (in this case, length and set-nth). Now, after a bunch of code is inlined, a piece of code comes up which compares the insertion index against the vector's length; however, the insertion index is the length in this case, and so value numbering is now able to optimize out a conditional which could not be optimized out before. Of course I could have just written a specialized version of push for vectors (or even built it into the VM) but I'd rather implement general optimizations and write my code in a high-level style.

I'd like to thank Doug Coleman for implementing some of the above improvements.

Branch splitting

Roughly speaking, branch splitting is the following transform:
[ A [ B ] [ C ] if D ] => [ A [ B D ] [ C D ] if ]

It is only run if D is a small amount of code. Branch splitting runs before value numbering, so value numbering's constant branch folding can help simplify the resulting control flow. Branch splitting is implemented in the compiler.cfg.branch-splitting vocabulary.

For example, suppose we have the following code:
[ [ t ] [ f ] if [ 1 ] [ 2 ] if ]

The high-level optimizer is unable to simplify this further, since it does not work on a control flow graph. The low level optimizer builds a control flow graph with the following shape:

The basic block in the middle is a candidate for splitting since it has two predecessors, and it is short. After splitting, the result looks like so:

Now, at this point, the only thing that this has achieved is eliminating one unconditional jump at the expense of code size. However, after stack analysis and value numbering run, some redundant work is eliminated:

One example of real code that benefits from branch splitting includes code that calls the = word followed by a conditional. The = word is inlined, and it has a conditional inside of it; some branches of the conditional push constants t and f, and others call non-inline words. The result is that the branches which push a constant boolean can directly jump to the destination block without the overhead of storing and testing a boolean value.

Once again thanks to Doug Coleman for collaborating with me on the implementation of branch splitting.

Block joining

Block joining is a simple optimization which merges basic blocks connected by a single control flow edge. Various low-level optimizations leave the control flow graph with a large number of empty or very small blocks with no conditional control flow between them. Block joining helps improve effectiveness of local optimizations by creating larger basic blocks.

In the above example for branch splitting, block joining will eliminate the four empty basic blocks that remain after optimization.

Block joining is implemented in the compiler.cfg.block-joining vocabulary.

Faster overflowing integer arithmetic intrinsics

In the lucky cases where the compiler can eliminate overflow checks, the various math operations become single instructions in the low-level IR which eventually map directly to machine instructions. In the general case, however, an overflow check has to be generated. In the event of overflow, a bignum is allocated, which may in turn involve running the garbage collector, so it is quite an expensive operation compared to a single machine arithmetic operation.

Previously, the overflowing fixnum operations were represented in low-level IR as single, indivisible units, and just like subroutine calls and returns, they were "sync points" which meant that all values in registers had to be saved to the data and retain stacks before, and reloaded after. This is because of how these operations were implemented; they would perform the arithmetic, do an overflow check, and in the event of overflow, they would invoke a subroutine which handled the overflow. Keeping registers live across this operation was not sound in the event of overflow, since the subroutine call would clobber them.

No longer. Now, an overflowing fixnum operation breaks down into several nodes in the control flow graph, and the arithmetic part, the overflow check and the bignum allocation are represented separately. In particular, the code to save registers to the stack and reload them after is only generated in the overflow case, so in the event of no overflow, which happens much more frequently, execution can "fall through" and continue using the same registers as before.

The code that generates low-level instructions and control flow graph nodes for overflowing fixnum operations is defined in the compiler.cfg.intrinsics.fixnum vocabulary.

Faster integer shifts

This last optimization is a very minor one, but it made a difference on benchmarks. Previously shifts with a constant shift count and no overflow check would compile as a machine instruction, and all other forms of shifts would invoke subroutine calls. Now, machine instructions are generated in the case where the shift count is unknown, but the result is still known to be small enough not to require an overflow check.

Benchmark results

Here are the results of running Factor's benchmark suite on a build from 31st May 2009, before I started working on the current set of low-level optimizer improvements (which includes the register allocator I blogged about previously), and from today.
benchmark.backtrack 1.767561 1.330641
benchmark.base64 1.997951 1.738677
benchmark.beust1 2.765257 2.461088
benchmark.beust2 3.584958 1.694427
benchmark.binary-search 1.55002 1.574595
benchmark.binary-trees 1.845798 1.733355
benchmark.bootstrap1 10.860492 11.447687
benchmark.dawes 0.229999 0.161726
benchmark.dispatch1 2.015653 2.119268
benchmark.dispatch2 1.817941 1.216618
benchmark.dispatch3 2.568987 1.899128
benchmark.dispatch4 2.319587 2.032182
benchmark.dispatch5 2.346744 1.614045
benchmark.empty-loop-0 0.146716 0.12589
benchmark.empty-loop-1 0.430314 0.342426
benchmark.empty-loop-2 0.429012 0.342097
benchmark.euler150 16.901451 15.288867
benchmark.euler186 8.8054349999999997.920478
benchmark.fannkuch 3.202698 2.964037
benchmark.fasta 5.52608 4.934112
benchmark.gc0 2.15066 1.993158
benchmark.gc1 4.984841 4.961272
benchmark.gc2 3.327706 3.265462
benchmark.iteration 3.736756 3.30438
benchmark.javascript 9.79904 9.164517
benchmark.knucleotide 0.282296 0.251879
benchmark.mandel 0.125304 0.123945
benchmark.md5 0.946516 0.85062
benchmark.nbody 3.982774 3.349595
benchmark.nested-empty-loop-10.116351 0.135936
benchmark.nested-empty-loop-20.692668 0.438932
benchmark.nsieve 0.714772 0.698262
benchmark.nsieve-bits 1.451828 0.907247
benchmark.nsieve-bytes 0.312481 0.300053
benchmark.partial-sums 1.205072 1.221245
benchmark.pidigits 16.088346 16.159176
benchmark.random 2.574773 2.706893
benchmark.raytracer 3.481714 2.914116
benchmark.recursive 5.964279 3.215277
benchmark.regex-dna 0.132406 0.093095
benchmark.reverse-complement 3.811822 3.257535
benchmark.ring 1.756481 1.79823
benchmark.sha1 2.267648 1.473887
benchmark.sockets 8.7942800000000018.783398
benchmark.sort 0.421628 0.363383
benchmark.spectral-norm 3.830249 4.036353
benchmark.stack 2.086594 1.014408
benchmark.sum-file 0.528061 0.422194
benchmark.tuple-arrays 0.127335 0.103421
benchmark.typecheck1 0.876559 0.6723440000000001
benchmark.typecheck2 0.878561 0.671624
benchmark.typecheck3 0.86596 0.670099
benchmark.ui-panes 0.426701 0.372301
benchmark.xml 2.351934 2.187999

There are a couple of regressions I need to look into but for the most part the results look pretty good. I also expect further gains to come with additional optimizations that I plan on implementing.

Next steps

I'm going to keep working on the code generator for another little while. First of all, compile time has increased so that needs to improve. Next, I'm going to implement better global optimization. At this point, values are stored in register between basic blocks, unlike with the old code generator. However, loop variables are still stored on the stack between iterations because the analysis does not handle back edges yet. Fixing this, and making float unboxing work globally, is my next step. After that, I plan on adding support for unboxed word-size integers (32 or 64-bits, depending on platform) and some intrinsics for SSE2 vector operations on Intel CPUs. Next, the register allocator needs improved coalescing logic, and it also needs to support fixed register assignments as found on some x86 instructions which take implicit operands. Finally, I have a few optimizations I want to add to the high level optimizer.

1 comment:

Anonymous said...

Interesting story you got here. I'd like to read something more about this matter. Thnx for giving this information.
Sexy Lady
English escort