Factor: a practical stack language: July 2009

Thursday, July 30, 2009

Dataflow analysis, computing dominance, and converting to and from SSA form

As I mentioned in my previous post, I'm working on having the compiler backend take advantage of our fancy new register allocator by making better use of registers than we do now.

As with the previous post, there are lots of links to Factor code here. Also, since much of what I implemented comes from academic papers, I've linked to the relevant literature also.

Over the course of the last year, Factor went from a rudimentary scheme where some intermediate values are placed in registers, to local register allocation for all temporaries in a basic block, to the current system where registers can remain live between basic blocks and values are only stored on the data stack when absolutely necessary; subroutine calls, returns, and points in the procedure from where the value will not be used again.

Before I dive in to the technical details, here is a taste of what the new code generator can do. The following Factor word,

: counted-loop-test ( -- ) 1000000000 [ ] times ;

Compiles to the following x86-64 machine code:

000000010c73b2a0: 48bd0050d6dc01000000  mov rbp, 0x1dcd65000
000000010c73b2aa: 4831c0                xor rax, rax
000000010c73b2ad: 4983c610              add r14, 0x10
000000010c73b2b1: e904000000            jmp 0x10c73b2ba
000000010c73b2b6: 4883c008              add rax, 0x8
000000010c73b2ba: 4839e8                cmp rax, rbp
000000010c73b2bd: 0f8cf3ffffff          jl dword 0x10c73b2b6
000000010c73b2c3: 4983ee10              sub r14, 0x10
000000010c73b2c7: c3                    ret

General overview

The general structure of the low-level optimizer still remains, along with the optimizations it performs -- alias analysis, value numbering, dead code elimination, and so on. However, what has changed in a major way is how the low-level IR is constructed, and how it is gets converted out of SSA. There are also two new abstractions used by the compiler; a dataflow analysis framework, and a meta-programming utility for renaming words.

In the low-level IR, "peek" and "replace" instructions are used to read and write stack locations, making stack traffic explicit. Peek instructions output a virtual register, and replace instructions take a register as input. All other instructions operate on virtual registers, not stack locations.

When building the control flow graph, the compiler.cfg.builder vocabulary used to insert "peek" and "replace" instructions at every use of a stack location, and subsequent redundancy elimination passes would get rid of the ones that are deemed redundant -- where either the relevant stack location was in a register already (peek after peek or peek after replace), or because it would be overwritten before being read again (replace after replace).

Now, the CFG builder doesn't insert peeks and replaces at all, and simply associates with each basic block a set of stack locations that it reads, and a set of stack locations that it writes. For each stack location, there is a globally unique virtual register which is used for it; instructions which need a stack location simply refer to that fixed virtual register (or assign to it). The last step of the CFG builder runs a dataflow analysis and actually inserts peek and replace instructions on the right edges in the CFG, mostly to ensure the invariant that values are saved to the stack between subroutine calls, that all values that are needed from the stack get loaded at some point, and that everything it saved to the stack eventually before the procedure returns. The inserted peeks and replaces reference the stack location's global fixed virtual register.

However, one thing in the above construction is that the output of the CFG builder is no longer in SSA form, like it used to be. However, the problem of converting applicative languages into SSA form is well-known, and so now I have an explicit SSA construction pass which runs after the CFG builder, before any other optimizations which operate on values. After optimizations, SSA form is eliminated by converting phi instructions into copies, at which point the result is passed on to the machine register allocator.

Dataflow analysis

The wikipedia page on dataflow analysis gives a good overview of the topic. Simple dataflow analyses with a single direction of flow all look quite similar, and there are many ways to abstract out the duplication. Since my code to convert stack-based code into low-level IR requires four dataflow analysis passes, and register allocation and SSA construction perform liveness analysis, I needed to eliminate the 5x code duplication that would result from naive implementation.

I went with an object-oriented approach, where a set of generic words are defined, together with a top-level driver word which takes an instance supporting these generic words. The generic words compute local sets, implement the join operation, and determine the direction of flow. The code is defined in the compiler.cfg.dataflow-analysis vocabulary. In addition to the generic words, I also define a couple of parsing words which create a new instance of the dataflow analysis class, together with some supporting boilerplate for looking up basic block's in and out sets after analysis has run. The actual content of the sets propagated depends on the analysis itself; liveness deals with virtual registers, and stack analysis deals with stack locations.

Because of this framework, adding new analyses is very easy. Liveness analysis, implemented in the compiler.cfg.liveness vocabulary, is only a few lines of code.

The dataflow analysis framework does not take SSA phi instructions into account, because stack analysis operates on stack locations (which are not single assignment) and liveness analysis is performed in SSA construction, when no phi instructions have been added yet, as well as register allocation, after SSA destruction.

Renaming values

Multiple compiler passes -- SSA construction, copy propagation, SSA destruction, and register allocation -- need to traverse over instructions and rename values. The first three replace uses of virtual registers with virtual registers, and the latter replaces virtual registers with physical registers prior to machine code generation. To avoid duplicating code, I wrote a utility which takes a hashtable of value mappings, and renames all values used in an instruction. There was a generic word with a method for each instruction. However this approach was insufficient for SSA construction, where the renaming set cannot be precomputed and instead changes at every instruction. So a better abstraction would be one that takes a quotation to apply to each value. However, dynamically calling a quotation for every slot of every instruction would be expensive, and inlining the entire renaming combinator at every use would be impractical. Instead, I used Factor's "functors" feature to write a parsing word which generates renaming logic customized to a particular use. This is defined in the compiler.cfg.renaming.functor vocabulary, and one usage among many is in compiler.cfg.renaming, which implements the old behavior whereby a single hashtable of renamings was input. Here is how this utility is used in SSA construction for instance:

RENAMING: ssa-rename [ gen-name ] [ top-name ] [ ]

This defines words ssa-rename-defs, ssa-rename-uses and ssa-rename-temps (the latter being a no-op). The first two apply gen-name and top-name to each value to compute the new value.

The register allocator does something similar in compiler.cfg.linear-scan.assignment:

RENAMING: assign [ vreg>reg ] [ vreg>reg ] [ vreg>reg ]

Here, words named assign-defs, assign-uses and assign-temps are defined, and they all perform the same operation on each instructions' defined values, used values and temporary values, respectively, by looking up the physical register assigned to each virtual register. This and other optimizations to linear scan significantly improved compile time, to the point where it has almost gone down to what it was before I started adding these fancy new optimizations.

Computing the dominator tree

The dominator tree is a useful concept for optimizations which need to take global control flow into account. At this point, I'm using them for SSA construction and destruction, but there are many other optimizations which depend on dominance information. The classical approach for computing dominance uses iterative dataflow analysis, but a better approach is given in the paper A Simple, Fast Dominance Algorithm by Keith D. Cooper, Timothy J. Harvey, and Ken Kennedy. I implemented the dominator tree computation described by the paper in the compiler.cfg.dominance vocabulary. The dominator tree is a tree with basic blocks as nodes; the entry block of the CFG as its root.

There are three operations that need to be performed, and this influences the tree representation:

Getting a node's parent in the tree, often referred to as the immediate dominator (sometimes, a convention is used where the immediate dominator of the root node is the root node itself).
Getting a node's children.
Testing if one node is an ancestor of another node (in this case we say that the first node dominates the second).

The algorithm in the paper computes a mapping from basic blocks to basic blocks, which gives the immediate dominator of every basic block. This lets us solve #1 in constant time. Since many algorithms need to look up children, I invert the immediate dominator mapping and turn it into a multimap (which in Factor, we just represent as an assoc from keys to sequences of values; the push-at utility word is helpful for constructing these). This gives us #2. For #3, the obvious approach is to walk up the tree from the second node, checking if the node at each step is the first node. However, this takes time proportional to the height of the tree, and the worst case is straight-line code with little control flow, where the dominator tree becomes a linked list. Another approach is to compute a hashed set of dominators for each node, which gives us constant time dominance checks, but the space usage is quadratic in the worst case so that's not ideal either.

A much better trick can be found in the paper on the SSA destruction algorithm which I will talk about below. I believe this technique goes back to the 70's and it is widely-known, but I did not learn about it until now. First, you perform a depth-first traversal of the dominator tree, incrementing a counter at each step. The first time you visit a node (on the way down), you assign the current counter value to the node's preorder value. The second time you visit a node (on the way up), you assign the current counter value to the node's maxpreorder value. What this does is number the nodes in preorder, and the maxpreorder of each node is the maximum of the preorder numbers of its children. Once these numbers have been precomputed, dominance checking can be done in constant time using the identity:

A dominates B iff preorder(A) >= preorder(B) & preorder(A) <= maxpreorder(B)

Of course, this works for any tree, and if you plan on doing many repeated child-of tests, it is worth precomputing the pre/maxpre numbers for each node in this manner. This addresses #3 above.

Here is a control flow graph, with the numbers denoting reverse post order on basic blocks:

and here is the corresponding dominator tree:

These diagrams were generated using the Graphviz tool together with the compiler.cfg.graphviz vocabulary).

The paper also gives an efficient algorithm for computing dominance frontiers, but I do not need those, for reasons given in the next section.

SSA construction

The classical approach for constructing SSA form yields what is known as minimal SSA involves three steps:

Computing dominance frontiers for each basic block in the control flow graph
For every value, take the set of basic blocks which have a definition of the value, compute the iterated dominance frontier of this set, and insert a phi instruction (with dummy inputs) in each member of the set
Walk the dominator tree, renaming definitions and usages in order to enforce single static assignment, updating phi instruction inputs along the way.

This approach has two three problems:

Computing iterated dominance frontiers is expensive, and this is done for every value defined in more than one block
Too many phi instructions are inserted, and most end up being eliminated as dead code later
The renaming algorithm, as originally specified, requires too much work to be done on the walk back up the dominator tree, with each block's instructions being traversed both on the way down and on the way up

Nevertheless, the algorithm is simple and easy to understand. It is explained in the original paper on SSA form, Efficiently computing static single assignment form and the control dependence graph; straightforward pseudocode can be found in these lecture notes.

The so-called "pruned SSA form" addresses the issue of too many phi instructions being inserted. It is a minor modification of minimal SSA construction. Prior to inserting phi instructions, liveness information is computed for each block. Then, a phi instruction is only inserted for a value if the value is live into the block. Computing liveness is somewhat expensive, and the so-called "semi-pruned SSA form" uses a simple heuristic to approximate liveness; phi nodes are only inserted for values which are used in blocks other than those they are defined in.

An algorithm for computing iterated dominance frontiers which does not require dominance frontiers to be computed first was described in the paper A Practical and Fast Iterative Algorithm for Phi-Function Computation Using DJ Graphs by
Dibyendu Das and U. Ramakrishna.

Finally, the paper introducing semi-pruned SSA form, titled Practical Improvements to the Construction and Destruction of Static Single Assignment Form, proposes a slightly more efficient renaming algorithm.

So the SSA construction algorithm I implemented in the compiler.cfg.ssa.construction vocabulary is a combination of these three papers. First, I compute merge sets using the DJ-Graph algorithm, then, I use liveness information for placing phi instructions, and finally, I use the improved renaming algorithm.

SSA destruction

To get out of SSA form and back to imperative code which can be executed by a machine, phi instructions must be eliminated. The approach originally described by Cytron et al gives incorrect results in many cases; the correct algorithm is well-known now but it introduces many additional copy instructions. The classical approach for eliminating copies ("copy coalescing") is to do it as part of register allocation; if two values are related by a copy but do not otherwise interfere, they can be assigned to the same physical register, and the copy can be deleted from the program. This works for a graph-coloring approach, but with linear scan, you're limited in how much global knowledge you can have while performing register allocation, and accurate interference information is difficult to obtain.

Factor's register allocation performs some basic coalescing, mostly to eliminate copies arising from conversion to two-operand form on x86. However, phi nodes introduce copies with more complex interferences and my approach didn't work there, so even though stack analysis eliminated many memory to register and register to memory moves, the generated code had a large number of register to register moves, which is a bottleneck for instruction decoding, not to mention wastes valuable cache space.

Instead of attempting to coalesce copies arising from phi instructions in the register allocator, a more modern approach is to do this as part of SSA destruction -- instead of converting phi instructions to copies, the goal is to avoid inserting as many copies as possible in the first place.

A nice algorithm for SSA destruction with coalescing is detailed in the paper Fast copy coalescing and live-range identification, by Zoran Budimlic et al. The algorithm is very clever -- the two key results are a constant-time interference test between SSA values using dominance and live range information, and a linear-time interference test in a set of variables using "dominance forests", which are easy to construct and enable you to rule out most pairs of values for interference tests altogether.

This algorithm is implemented in LLVM (lib/CodeGen/StrongPHIElimination.cpp) and I essentially ported the C++ code to Factor. You can find the code in the compiler.cfg.ssa.destruction vocabulary.

I tweaked the algorithm a little -- instead of inserting sequential copies, I insert parallel copies, as detailed in the paper Revisiting Out-of-SSA Translation for Correctness, Code Quality and Efficiency by Benoit Boissinot et al. The parallel copy algorithm is implemented in compiler.cfg.parallel-copy. I used it not only in SSA destruction, but also to clean up some hackish approximations of the same problem in global stack analysis and the linear scan register allocator's resolve pass. I didn't implement the rest of the paper, because I found it hard to follow; it claims to provide a better algorithm than Budimlic et al, but the latter is good enough for now, and having a working implementation in the form of the LLVM pass was very valuable in implementing it in Factor.

This pass was very effective in eliminating copy instructions; it generates 75% of copies than the old phi elimination pass, which simply used the naive algorithm.

Loop head rotation

The last optimization I added eliminates an unconditional branch in some loops. Consider the following CFG:

A valid way to linearize the basic blocks is in reverse post order, that is 0, 1, 2, 3, 4, 5. However, with this linearization, block 3 has an unconditional branch back to block 2, and block 2 has a conditional which either falls through to 3 or jumps to 4. So on every loop iteration, two jumps are executed (the conditional jump at 2 and the unconditional jump at 3). If, instead, the CFG was linearized as 0, 1, 3, 2, 4, 5, then while 1 would have an unconditional jump to 2, 2 would have a conditional jump back to 3 and 3 would fall through to 2. So upon entry to the loop, an extra unconditional jump (from 1 to 3) executes, but on each iteration, there is just the single conditional backward jump at 2. This improves performance slightly and is easy to implement; the code is in the compiler.cfg.linearization.order vocabulary, and I borrowed the algorithm from SBCL's src/compiler/control.lisp.

Conclusion

There are some performance regressions I need to work out, because global stack analysis introduces too many partial redundancies for some types of code, and inline GC checks are currently disabled because of an unrelated issue I need to fix. It will take a few more days of tweaking to sort things out, and then I will post some benchmarks. Early results are already very promising on benchmarks with loops; not just the trivial counted loop example above.

Next steps are global float unboxing, unboxed 32-bit and 64-bit integer arithmetic, and SSE intrinsics. To help with the latter, Joe Groff was kind enough to add support for all SSE1/2/3/4 instructions to Factor's x86 assembler. Finally, thanks to Cameron Zwarich for pointing me at some of the papers I linked to above.

Friday, July 17, 2009

Improved value numbering, branch splitting, and faster overflow checks

Hot on the heels on the improved register allocator, I've added some new optimizations to the low-level optimizer and made existing optimizations stronger.

Improved value numbering

Local value numbering now detects more congruences along values computed by low level IR instructions. All the changes were either in the rewrite or simplify steps.

Some additional arithmetic identities. For instance, the result of adding or subtracting 0 to a value is congruent to the value itself. The high-level optimizer implements the same optimization on Factor's generic arithmetic words, but sometimes low-level optimizations introduce additional redundancies.
A ##add instruction where one of the two operands is the result of a ##load-immediate is now converted into a ##add-imm, saving a register. A similar identity holds for other commutative operations, such as ##mul, ##and, ##or, ##xor, and ##compare with a condition code of cc=.
Constant folding. Arithmetic instructions where both operands are constants are replaced by a ##load-immediate containing the result. A special case is a subtraction where both operands are congruent; this is replaced by a load of 0. Again, the high-level optimizer performs constant folding also, but low-level optimizations sometimes introduce redundancies or expose opportunities for optimization which were not apparent in high-level IR.
Reassociation. If an ##add-imm instruction's first input is the result of another ##add-imm, then a new ##add-imm is made which computes the same result with a single addition. Algebraically, this corresponds to the identity (x + a) + b = x + (a + b), where a and b are known at compile-time. Similar transformations are made for other associative operations, such as ##sub, ##mul, ##and, ##or and ##xor.
Constant branch folding. If both inputs to a ##compare or ##compare-branch are constant, or if both are congruent to the same value, then the result of the comparison can be computed at compile-time, and one of the two successor blocks can be deleted.

While the high-level optimizer performs constant folding and unreachable code elimination as part of the SCCP pass, low-level optimizations which are done before value numbering can expose optimization opportunities which were invisible in high-level IR. For example, consider the following code:
[ { vector } declare [ length ] [ length ] bi < ]
Using the optimized. word, we can see what the high-level optimizer transforms this into:

[ dup >R 3 slot R> 3 slot fixnum< ]

The high-level optimizer only attempts to reason about immutable slots, but a vector's length is mutable since a vector may have elements added to it, and so the high-level optimizer cannot detect that the two inputs to fixnum< are equal. A conversion into low-level IR, the code becomes the following sequence of SSA instructions:

_label 0 f 
##prologue f 
_label 1 f 
##peek V int-regs 1 D 0 f 
##copy V int-regs 4 V int-regs 1 f 
##slot-imm V int-regs 5 V int-regs 4 3 7 f 
##copy V int-regs 8 V int-regs 1 f 
##slot-imm V int-regs 9 V int-regs 8 3 7 f 
##copy V int-regs 10 V int-regs 5 f 
##copy V int-regs 11 V int-regs 9 f 
##compare V int-regs 12 V int-regs 10 V int-regs 11 cc< V int-regs 13 f 
##replace V int-regs 12 D 0 f 
##epilogue f 
##return f 
_spill-counts f f

Now, the low-level alias analysis optimization is able to detect that the second ##slot-imm is redundant, and it transforms the code into the following:

_label 0 f 
##prologue f 
_label 1 f 
##peek V int-regs 1 D 0 f 
##copy V int-regs 4 V int-regs 1 f 
##slot-imm V int-regs 5 V int-regs 4 3 7 f 
##copy V int-regs 9 V int-regs 5 f 
##copy V int-regs 10 V int-regs 5 f 
##copy V int-regs 11 V int-regs 9 f 
##compare V int-regs 12 V int-regs 10 V int-regs 11 cc< V int-regs 13 f 
##replace V int-regs 12 D 0 f 
##epilogue f 
##return f 
_spill-counts f f

Now, notice that both inputs to the ##compare instruction (vreg #10 and vreg #11) are copies of the same value, vreg #5. Value numbering detects this congruence since copy propagation is just a special case of value numbering; it then simplifies the comparison down to a constant load of f, since for any integer x, we have x < x => false. After dead code elimination, the result is the following:

_label 0 f 
##prologue f 
_label 1 f 
##load-immediate V int-regs 12 5 f 
##replace V int-regs 12 D 0 f 
##epilogue f 
##return f 
_spill-counts f f

While no programmer would explicitly write such redundant code, it comes up after inlining. For example, the push word, which adds an element at the end of a sequence, is implemented as follows:

: push ( elt seq -- ) [ length ] [ set-nth ] bi ;

HINTS: push { vector } { sbuf } ;

The hints tell the compiler to compile specialized versions of this word for vectors and string buffers. While the generic versions work on any sequence, the hinted versions are used when the input types match, and the result can be better performance if generic words are inlined (in this case, length and set-nth). Now, after a bunch of code is inlined, a piece of code comes up which compares the insertion index against the vector's length; however, the insertion index is the length in this case, and so value numbering is now able to optimize out a conditional which could not be optimized out before. Of course I could have just written a specialized version of push for vectors (or even built it into the VM) but I'd rather implement general optimizations and write my code in a high-level style.

I'd like to thank Doug Coleman for implementing some of the above improvements.

Branch splitting

Roughly speaking, branch splitting is the following transform:

[ A [ B ] [ C ] if D ] => [ A [ B D ] [ C D ] if ]

It is only run if D is a small amount of code. Branch splitting runs before value numbering, so value numbering's constant branch folding can help simplify the resulting control flow. Branch splitting is implemented in the compiler.cfg.branch-splitting vocabulary.

For example, suppose we have the following code:

[ [ t ] [ f ] if [ 1 ] [ 2 ] if ]

The high-level optimizer is unable to simplify this further, since it does not work on a control flow graph. The low level optimizer builds a control flow graph with the following shape:

The basic block in the middle is a candidate for splitting since it has two predecessors, and it is short. After splitting, the result looks like so:

Now, at this point, the only thing that this has achieved is eliminating one unconditional jump at the expense of code size. However, after stack analysis and value numbering run, some redundant work is eliminated:

One example of real code that benefits from branch splitting includes code that calls the = word followed by a conditional. The = word is inlined, and it has a conditional inside of it; some branches of the conditional push constants t and f, and others call non-inline words. The result is that the branches which push a constant boolean can directly jump to the destination block without the overhead of storing and testing a boolean value.

Once again thanks to Doug Coleman for collaborating with me on the implementation of branch splitting.

Block joining

Block joining is a simple optimization which merges basic blocks connected by a single control flow edge. Various low-level optimizations leave the control flow graph with a large number of empty or very small blocks with no conditional control flow between them. Block joining helps improve effectiveness of local optimizations by creating larger basic blocks.

In the above example for branch splitting, block joining will eliminate the four empty basic blocks that remain after optimization.

Block joining is implemented in the compiler.cfg.block-joining vocabulary.

Faster overflowing integer arithmetic intrinsics

In the lucky cases where the compiler can eliminate overflow checks, the various math operations become single instructions in the low-level IR which eventually map directly to machine instructions. In the general case, however, an overflow check has to be generated. In the event of overflow, a bignum is allocated, which may in turn involve running the garbage collector, so it is quite an expensive operation compared to a single machine arithmetic operation.

Previously, the overflowing fixnum operations were represented in low-level IR as single, indivisible units, and just like subroutine calls and returns, they were "sync points" which meant that all values in registers had to be saved to the data and retain stacks before, and reloaded after. This is because of how these operations were implemented; they would perform the arithmetic, do an overflow check, and in the event of overflow, they would invoke a subroutine which handled the overflow. Keeping registers live across this operation was not sound in the event of overflow, since the subroutine call would clobber them.

No longer. Now, an overflowing fixnum operation breaks down into several nodes in the control flow graph, and the arithmetic part, the overflow check and the bignum allocation are represented separately. In particular, the code to save registers to the stack and reload them after is only generated in the overflow case, so in the event of no overflow, which happens much more frequently, execution can "fall through" and continue using the same registers as before.

The code that generates low-level instructions and control flow graph nodes for overflowing fixnum operations is defined in the compiler.cfg.intrinsics.fixnum vocabulary.

Faster integer shifts

This last optimization is a very minor one, but it made a difference on benchmarks. Previously shifts with a constant shift count and no overflow check would compile as a machine instruction, and all other forms of shifts would invoke subroutine calls. Now, machine instructions are generated in the case where the shift count is unknown, but the result is still known to be small enough not to require an overflow check.

Benchmark results

Here are the results of running Factor's benchmark suite on a build from 31st May 2009, before I started working on the current set of low-level optimizer improvements (which includes the register allocator I blogged about previously), and from today.

Benchmark	Before	After
benchmark.backtrack	1.767561	1.330641
benchmark.base64	1.997951	1.738677
benchmark.beust1	2.765257	2.461088
benchmark.beust2	3.584958	1.694427
benchmark.binary-search	1.55002	1.574595
benchmark.binary-trees	1.845798	1.733355
benchmark.bootstrap1	10.860492	11.447687
benchmark.dawes	0.229999	0.161726
benchmark.dispatch1	2.015653	2.119268
benchmark.dispatch2	1.817941	1.216618
benchmark.dispatch3	2.568987	1.899128
benchmark.dispatch4	2.319587	2.032182
benchmark.dispatch5	2.346744	1.614045
benchmark.empty-loop-0	0.146716	0.12589
benchmark.empty-loop-1	0.430314	0.342426
benchmark.empty-loop-2	0.429012	0.342097
benchmark.euler150	16.901451	15.288867
benchmark.euler186	8.805434999999999	7.920478
benchmark.fannkuch	3.202698	2.964037
benchmark.fasta	5.52608	4.934112
benchmark.gc0	2.15066	1.993158
benchmark.gc1	4.984841	4.961272
benchmark.gc2	3.327706	3.265462
benchmark.iteration	3.736756	3.30438
benchmark.javascript	9.79904	9.164517
benchmark.knucleotide	0.282296	0.251879
benchmark.mandel	0.125304	0.123945
benchmark.md5	0.946516	0.85062
benchmark.nbody	3.982774	3.349595
benchmark.nested-empty-loop-1	0.116351	0.135936
benchmark.nested-empty-loop-2	0.692668	0.438932
benchmark.nsieve	0.714772	0.698262
benchmark.nsieve-bits	1.451828	0.907247
benchmark.nsieve-bytes	0.312481	0.300053
benchmark.partial-sums	1.205072	1.221245
benchmark.pidigits	16.088346	16.159176
benchmark.random	2.574773	2.706893
benchmark.raytracer	3.481714	2.914116
benchmark.recursive	5.964279	3.215277
benchmark.regex-dna	0.132406	0.093095
benchmark.reverse-complement	3.811822	3.257535
benchmark.ring	1.756481	1.79823
benchmark.sha1	2.267648	1.473887
benchmark.sockets	8.794280000000001	8.783398
benchmark.sort	0.421628	0.363383
benchmark.spectral-norm	3.830249	4.036353
benchmark.stack	2.086594	1.014408
benchmark.sum-file	0.528061	0.422194
benchmark.tuple-arrays	0.127335	0.103421
benchmark.typecheck1	0.876559	0.6723440000000001
benchmark.typecheck2	0.878561	0.671624
benchmark.typecheck3	0.86596	0.670099
benchmark.ui-panes	0.426701	0.372301
benchmark.xml	2.351934	2.187999

There are a couple of regressions I need to look into but for the most part the results look pretty good. I also expect further gains to come with additional optimizations that I plan on implementing.

Next steps

I'm going to keep working on the code generator for another little while. First of all, compile time has increased so that needs to improve. Next, I'm going to implement better global optimization. At this point, values are stored in register between basic blocks, unlike with the old code generator. However, loop variables are still stored on the stack between iterations because the analysis does not handle back edges yet. Fixing this, and making float unboxing work globally, is my next step. After that, I plan on adding support for unboxed word-size integers (32 or 64-bits, depending on platform) and some intrinsics for SSE2 vector operations on Intel CPUs. Next, the register allocator needs improved coalescing logic, and it also needs to support fixed register assignments as found on some x86 instructions which take implicit operands. Finally, I have a few optimizations I want to add to the high level optimizer.

Thursday, July 09, 2009

Improvements to Factor's register allocator

For the last couple of months I've been working on some improvements to Factor's low-level optimizer and code generator. The major new change is that the control flow graph IR is now allowed to define registers that are used in more than one basic block. Previously, all values would be loaded from the data stack at the start of a basic block and stored to the stack at the end. Register allocation in this case is called local register allocation. Now the only time when values have to be saved to the stack is for subroutine calls. The part of the compiler most affected by this is of course the register allocator, because now it needs to do global register allocation.

I will describe the new optimizations in a future blog post and talk about global register allocation here. To help people who are learning Factor or are simply curious about what real Factor code looks like, I've inserted lots of links from this blog post to our online code repository browser.

Instead of rewriting the register allocator from scratch, I took a more incremental approach, gradually refactoring it to operate on the entire control flow graph instead of a basic block at a time. Generalizing the linear scan algorithm to do this is relatively straightforward, but there are some subtleties. Once again, I used Christian Wimmer's masters thesis as a guide, however this time around I implemented almost the entire algorithm described there, instead of the simplification to the local case.

First, recall some terminology: virtual registers are an abstraction of an ideal CPUs register file: there can be arbitrarily many of them, and each one is only ever assigned to once. The register allocator's job is to rewrite the program to use physical registers instead, possibly mapping a number of virtual registers to the same physical register, and inserting spills and reloads if there are insufficient physical registers to store all temporary values used by the program.

Linearization

A local register allocator operates on a basic block, which is a linear list of instructions. For the global case, linear scan still requires a linear list of instructions as input, so the first step is to traverse the CFG in reverse post order and number the instructions.

Building live intervals

Once a linear ordering among all instructions has been stablished, the second step in linear scan register allocation is building live intervals. In the local case, a virtual register is defined in the same basic block as all of its usages. Furthermore, if we assume single assignment, then there will be only a single definition, and it will precede all uses. So the set of instructions where the register is live is a single interval, which begins at the definition and ends at the last use.

In the global case, the situation is more complicated. First of all, because the linearization chosen is essentially arbitrary, and does not reflect the actual control flow in reality, instructions are not necessarily executed in order, so if a register is defined at position A and used at position B, it does not mean that the register is live at every position between A and B. Indeed, in the global case, the live "interval" of a register may consist of a number of disjoint ranges of instructions.

As an example, consider the following C program:

1: int x = ...;
2: if(...) {
3:     ... a bunch of code that does not use x
4: } else {
5:     int y = x + 1;
6:     ... code that doesn't use x
7: }
8: ... more code that doesn't use x

The value x is live at position 1, and also at position 5. The last usage of x is at line 5, so it is not live at any position after 5, but also, it is not live at position 3 either, since if the true branch is taken, then x is never used at all. We say that x has a lifetime hole at position 3.

Another case where lifetime holes come up is virtual registers with multiple definitions. The very last step in the low level optimizer, right before register allocation, is phi node elimination. Essentially, phi node elimination turns this,

if(informal) {
    x1 = "hi";
} else {
    x2 = "hello";
}
x = phi(x1,x2);

into the following:

if(informal) {
    x1 = "hi";
    x = x1;
} else {
    x2 = "hello";
    x = x2;
}

Now, the value x is not live at the line x2 = "hello";, because it hasn't been defined yet, and the previous definition is not available either. So the lifetime interval for x has a hole at this location.

To represent such complex live intervals, the live-interval data type now contains a list of live ranges. Since basic blocks cannot contain control flow or multiple definitions, there can be at most one live range per basic block per live interval. (The only time multiple definitions come up is as a result of Phi node elimination, and since the input is in SSA form this cannot introduce multiple assignments in the same block).

Information from liveness analysis is used to construct live ranges. A basic block's live range contains the first definition of the register as well as the last use. If the register is in the "live in" set of the basic block, then the range begins at the block's first instruction, and if the register is in the "live out" set then the range ends at the block's last instruction. Note that even if a basic block has no usages of a register, it may appear in both the live in and live out sets; for example, the register might be defined in a predecessor block, and used in a successor block.

Allocating physical registers to live intervals

Once live intervals have been built, the next step is to assign physical registers to them. During this process, live intervals may be split, with register to memory, memory to register and register to register moves inserted in between; a single virtual register may be in different physical registers and memory locations during its lifetime.

In the local case, the allocation algorithm maintains two pieces of state; the "unhandled set", which is a heap of intervals sorted by start position, and an "active set", which is a list of intervals currently assigned to physical registers. The main algorithm removes the smallest element from the unhandled set, and then removes anything from the active set which ends before the new interval begins. The next step depends on whether or not any physical registers are free. If all physical registers are taken up in the active set, then a decision is made to spill either the new interval, or a member of the active set. Spilling will free up at least one physical register. At this point, a physical register is now free and can be assigned to the new interval, which is then added to the active set. Spilling may split intervals into smaller pieces, and add new elements to the unhandled set, but since the new split intervals are strictly smaller than the original interval, the process eventually terminates. Once the unhandled set is empty, allocation is complete.

The global case is more complicated, primarily due to the presence of lifetime holes in live intervals. A third piece of state is maintained by the allocation pass. This new set, the "inactive set", contains live intervals which have not ended yet, but are currently in a lifetime hole. To illustrate, suppose we have two physical registers, R1 and R2, and three virtual registers, V1, V2, and V3, with the following live intervals:


    [Program start ....... Program end]
V1: <======>                <=========>
V2:         <==============>
V3:              <================>
                 ^
                 |
           Current position

Immediately before V3's live interval begins, the active set includes V2, and the inactive set contains V1, since V1 is not finished yet, but the current position is in a lifetime hole. Suppose that V1 was assigned to physical register R1, and V2 was assigned to physical register R2. In the local case, since the active set only includes one element, V1, we could conclude that R2 was free, and that V3 could be assigned to R2. However, this is invalid, since at a later point, V1 becomes active again, and V1 uses R2. Two overlapping live intervals cannot use the same physical register. So global linear scan also has to examine the inactive set in order to see if the new live interval can fit in an inactive interval's lifetime hole. In this case, it cannot, so we have to split V3, and insert a copy instruction in the middle:

      [Program start ....... Program end]
V1:   <======>                <=========>
V2:           <==============>
V3_1:              <=========>
V3_2:                         <====>
                   ^
                   |
             Current position

With V3 split into V3_1 and V3_2, a valid register assignment is possible, with V3_1 stored in R1 and V3_2 stored in R2.

In this case, a register assignment without any spilling is possible. However, sometimes, virtual registers have to be stored to memory. This is done by the spilling code which uses a similar algorithm to the local case. The main complication is that in order to free up a physical register for the entire lifetime of the new interval, more than one interval may need to be split; zero or one active intervals, and zero or more inactive intervals.

Assigning physical registers to instructions

After live intervals are split, and physical registers are assigned to live intervals, the assignment pass iterates over all instructions, storing a mapping from virtual registers to physical registers in each instruction. Recall that this cannot be a global mapping, since a virtual register may reside in several different physical registers at different locations in the program.

The algorithm here is essentially unchanged from the local case. Two additions are that at the start and end of each basic block, the assignment pass records which registers are live. This information is used by GC checks. GC checks are performed on basic block boundaries of blocks which allocate memory. Since a register may be defined in one basic block, and used in another, with a GC check in between, GC checks need to store registers which contain pointers to a special location in the call stack and pass a pointer to this location to the GC. Some language implementations use a conservative GC to get around having to do this, but I think recording accurate pointer root information is a superior approach. This liveness information is also used by the resolve pass, coming up next.

The resolve pass

Since linear scan does not take control flow into account, interval splitting will give incorrect results if implemented naively. Consider the following example program:

x = ...;

if(...) {
    ...
    y = x + 1;
    ... some code with high register pressure here
} else {
    z = x + 2;
}

If the live interval of x was spilled immediately after the line y = x + 1; then the register allocator would rewrite it as follows -- it would be spilled after the last usage before the spill point, and reloaded before the first usage after the spill point:

x = ...;

if(...) {
    ...
    y = x + 1;
    spill(x);
    ... some code with high register pressure here
} else {
    reload(x);
    z = x + 2;
}

In this case however, the spill and reload locations are in different control flow paths. However, linear scan does not take control flow into account during the main algorithm. Instead, a subsequent resolve pass. The resolve pass looks at every edge in the control flow graph, and compares the physical register and spill slot assignments of all virtual registers which are both live across the edge. If they differ, moves, spills and reloads are inserted on the edge. Conceptually, this pass is simple, but it feels hackish -- it seems that a more elegant register allocator would incorporate this into the previous stages of allocation. However, all formulations of linear scan I've seen so far work this way. The only tricky part about the resolve pass is that the moves have to be performed "in parallel"; for example, suppose we have two virtual registers V1 and V2, and two physical registers R1 and R2. Block A and block B are joined by an edge. At the end of block A, V1 is in R1, and V2 is in R2. At the start of block B, V1 is in R2 and V2 is in R1. Doing the moves in any order would give the wrong result; for correct behavior, the cycle has to be broken with a third temporary location. The logic for doing this in the general case is a bit complex; Doug Coleman helped me implement the mapping code.

General observations

While implementing the register allocator improvements, I relied heavily on unit testing as well as an informal "design by contract"; basically, making extensive use of runtime assertions in the code itself. The compiler.cfg.linear-scan vocabulary weighs in at 3828 lines of code; 2484 lines of this are unit tests. This is certainly a much higher test:code ratio than most other Factor code I've written, and some of the tests are not very useful or redundant. However, they represent a log of features and bugs I've implemented, and at the very least, minimize regressions.

When testing the register allocator, I first got it working on x86-64. Here, there are enough registers that spilling almost never occurs, so I could focus on getting the primary logic correct. Then, I hacked the x86-64 backend to only use 4 physical registers, and debugged spilling.

The whole project took a bit longer than I had hoped, but I managed to start implementing a number of additional optimizations in the meantime. When they are more complete I will blog about them.

Factor: a practical stack language