Thursday, July 30, 2009

Dataflow analysis, computing dominance, and converting to and from SSA form

As I mentioned in my previous post, I'm working on having the compiler backend take advantage of our fancy new register allocator by making better use of registers than we do now.

As with the previous post, there are lots of links to Factor code here. Also, since much of what I implemented comes from academic papers, I've linked to the relevant literature also.

Over the course of the last year, Factor went from a rudimentary scheme where some intermediate values are placed in registers, to local register allocation for all temporaries in a basic block, to the current system where registers can remain live between basic blocks and values are only stored on the data stack when absolutely necessary; subroutine calls, returns, and points in the procedure from where the value will not be used again.

Before I dive in to the technical details, here is a taste of what the new code generator can do. The following Factor word,
: counted-loop-test ( -- ) 1000000000 [ ] times ;

Compiles to the following x86-64 machine code:
000000010c73b2a0: 48bd0050d6dc01000000  mov rbp, 0x1dcd65000
000000010c73b2aa: 4831c0 xor rax, rax
000000010c73b2ad: 4983c610 add r14, 0x10
000000010c73b2b1: e904000000 jmp 0x10c73b2ba
000000010c73b2b6: 4883c008 add rax, 0x8
000000010c73b2ba: 4839e8 cmp rax, rbp
000000010c73b2bd: 0f8cf3ffffff jl dword 0x10c73b2b6
000000010c73b2c3: 4983ee10 sub r14, 0x10
000000010c73b2c7: c3 ret

General overview

The general structure of the low-level optimizer still remains, along with the optimizations it performs -- alias analysis, value numbering, dead code elimination, and so on. However, what has changed in a major way is how the low-level IR is constructed, and how it is gets converted out of SSA. There are also two new abstractions used by the compiler; a dataflow analysis framework, and a meta-programming utility for renaming words.

In the low-level IR, "peek" and "replace" instructions are used to read and write stack locations, making stack traffic explicit. Peek instructions output a virtual register, and replace instructions take a register as input. All other instructions operate on virtual registers, not stack locations.

When building the control flow graph, the compiler.cfg.builder vocabulary used to insert "peek" and "replace" instructions at every use of a stack location, and subsequent redundancy elimination passes would get rid of the ones that are deemed redundant -- where either the relevant stack location was in a register already (peek after peek or peek after replace), or because it would be overwritten before being read again (replace after replace).

Now, the CFG builder doesn't insert peeks and replaces at all, and simply associates with each basic block a set of stack locations that it reads, and a set of stack locations that it writes. For each stack location, there is a globally unique virtual register which is used for it; instructions which need a stack location simply refer to that fixed virtual register (or assign to it). The last step of the CFG builder runs a dataflow analysis and actually inserts peek and replace instructions on the right edges in the CFG, mostly to ensure the invariant that values are saved to the stack between subroutine calls, that all values that are needed from the stack get loaded at some point, and that everything it saved to the stack eventually before the procedure returns. The inserted peeks and replaces reference the stack location's global fixed virtual register.

However, one thing in the above construction is that the output of the CFG builder is no longer in SSA form, like it used to be. However, the problem of converting applicative languages into SSA form is well-known, and so now I have an explicit SSA construction pass which runs after the CFG builder, before any other optimizations which operate on values. After optimizations, SSA form is eliminated by converting phi instructions into copies, at which point the result is passed on to the machine register allocator.

Dataflow analysis

The wikipedia page on dataflow analysis gives a good overview of the topic. Simple dataflow analyses with a single direction of flow all look quite similar, and there are many ways to abstract out the duplication. Since my code to convert stack-based code into low-level IR requires four dataflow analysis passes, and register allocation and SSA construction perform liveness analysis, I needed to eliminate the 5x code duplication that would result from naive implementation.

I went with an object-oriented approach, where a set of generic words are defined, together with a top-level driver word which takes an instance supporting these generic words. The generic words compute local sets, implement the join operation, and determine the direction of flow. The code is defined in the compiler.cfg.dataflow-analysis vocabulary. In addition to the generic words, I also define a couple of parsing words which create a new instance of the dataflow analysis class, together with some supporting boilerplate for looking up basic block's in and out sets after analysis has run. The actual content of the sets propagated depends on the analysis itself; liveness deals with virtual registers, and stack analysis deals with stack locations.

Because of this framework, adding new analyses is very easy. Liveness analysis, implemented in the compiler.cfg.liveness vocabulary, is only a few lines of code.

The dataflow analysis framework does not take SSA phi instructions into account, because stack analysis operates on stack locations (which are not single assignment) and liveness analysis is performed in SSA construction, when no phi instructions have been added yet, as well as register allocation, after SSA destruction.

Renaming values

Multiple compiler passes -- SSA construction, copy propagation, SSA destruction, and register allocation -- need to traverse over instructions and rename values. The first three replace uses of virtual registers with virtual registers, and the latter replaces virtual registers with physical registers prior to machine code generation. To avoid duplicating code, I wrote a utility which takes a hashtable of value mappings, and renames all values used in an instruction. There was a generic word with a method for each instruction. However this approach was insufficient for SSA construction, where the renaming set cannot be precomputed and instead changes at every instruction. So a better abstraction would be one that takes a quotation to apply to each value. However, dynamically calling a quotation for every slot of every instruction would be expensive, and inlining the entire renaming combinator at every use would be impractical. Instead, I used Factor's "functors" feature to write a parsing word which generates renaming logic customized to a particular use. This is defined in the compiler.cfg.renaming.functor vocabulary, and one usage among many is in compiler.cfg.renaming, which implements the old behavior whereby a single hashtable of renamings was input. Here is how this utility is used in SSA construction for instance:
RENAMING: ssa-rename [ gen-name ] [ top-name ] [ ]

This defines words ssa-rename-defs, ssa-rename-uses and ssa-rename-temps (the latter being a no-op). The first two apply gen-name and top-name to each value to compute the new value.

The register allocator does something similar in compiler.cfg.linear-scan.assignment:
RENAMING: assign [ vreg>reg ] [ vreg>reg ] [ vreg>reg ]

Here, words named assign-defs, assign-uses and assign-temps are defined, and they all perform the same operation on each instructions' defined values, used values and temporary values, respectively, by looking up the physical register assigned to each virtual register. This and other optimizations to linear scan significantly improved compile time, to the point where it has almost gone down to what it was before I started adding these fancy new optimizations.

Computing the dominator tree

The dominator tree is a useful concept for optimizations which need to take global control flow into account. At this point, I'm using them for SSA construction and destruction, but there are many other optimizations which depend on dominance information. The classical approach for computing dominance uses iterative dataflow analysis, but a better approach is given in the paper A Simple, Fast Dominance Algorithm by Keith D. Cooper, Timothy J. Harvey, and Ken Kennedy. I implemented the dominator tree computation described by the paper in the compiler.cfg.dominance vocabulary. The dominator tree is a tree with basic blocks as nodes; the entry block of the CFG as its root.

There are three operations that need to be performed, and this influences the tree representation:
  1. Getting a node's parent in the tree, often referred to as the immediate dominator (sometimes, a convention is used where the immediate dominator of the root node is the root node itself).
  2. Getting a node's children.
  3. Testing if one node is an ancestor of another node (in this case we say that the first node dominates the second).

The algorithm in the paper computes a mapping from basic blocks to basic blocks, which gives the immediate dominator of every basic block. This lets us solve #1 in constant time. Since many algorithms need to look up children, I invert the immediate dominator mapping and turn it into a multimap (which in Factor, we just represent as an assoc from keys to sequences of values; the push-at utility word is helpful for constructing these). This gives us #2. For #3, the obvious approach is to walk up the tree from the second node, checking if the node at each step is the first node. However, this takes time proportional to the height of the tree, and the worst case is straight-line code with little control flow, where the dominator tree becomes a linked list. Another approach is to compute a hashed set of dominators for each node, which gives us constant time dominance checks, but the space usage is quadratic in the worst case so that's not ideal either.

A much better trick can be found in the paper on the SSA destruction algorithm which I will talk about below. I believe this technique goes back to the 70's and it is widely-known, but I did not learn about it until now. First, you perform a depth-first traversal of the dominator tree, incrementing a counter at each step. The first time you visit a node (on the way down), you assign the current counter value to the node's preorder value. The second time you visit a node (on the way up), you assign the current counter value to the node's maxpreorder value. What this does is number the nodes in preorder, and the maxpreorder of each node is the maximum of the preorder numbers of its children. Once these numbers have been precomputed, dominance checking can be done in constant time using the identity:
A dominates B iff preorder(A) >= preorder(B) & preorder(A) <= maxpreorder(B)

Of course, this works for any tree, and if you plan on doing many repeated child-of tests, it is worth precomputing the pre/maxpre numbers for each node in this manner. This addresses #3 above.

Here is a control flow graph, with the numbers denoting reverse post order on basic blocks:

and here is the corresponding dominator tree:

These diagrams were generated using the Graphviz tool together with the compiler.cfg.graphviz vocabulary).

The paper also gives an efficient algorithm for computing dominance frontiers, but I do not need those, for reasons given in the next section.

SSA construction

The classical approach for constructing SSA form yields what is known as minimal SSA involves three steps:
  1. Computing dominance frontiers for each basic block in the control flow graph
  2. For every value, take the set of basic blocks which have a definition of the value, compute the iterated dominance frontier of this set, and insert a phi instruction (with dummy inputs) in each member of the set
  3. Walk the dominator tree, renaming definitions and usages in order to enforce single static assignment, updating phi instruction inputs along the way.

This approach has two three problems:
  1. Computing iterated dominance frontiers is expensive, and this is done for every value defined in more than one block
  2. Too many phi instructions are inserted, and most end up being eliminated as dead code later
  3. The renaming algorithm, as originally specified, requires too much work to be done on the walk back up the dominator tree, with each block's instructions being traversed both on the way down and on the way up
Nevertheless, the algorithm is simple and easy to understand. It is explained in the original paper on SSA form, Efficiently computing static single assignment form and the control dependence graph; straightforward pseudocode can be found in these lecture notes.

The so-called "pruned SSA form" addresses the issue of too many phi instructions being inserted. It is a minor modification of minimal SSA construction. Prior to inserting phi instructions, liveness information is computed for each block. Then, a phi instruction is only inserted for a value if the value is live into the block. Computing liveness is somewhat expensive, and the so-called "semi-pruned SSA form" uses a simple heuristic to approximate liveness; phi nodes are only inserted for values which are used in blocks other than those they are defined in.

An algorithm for computing iterated dominance frontiers which does not require dominance frontiers to be computed first was described in the paper A Practical and Fast Iterative Algorithm for Phi-Function Computation Using DJ Graphs by
Dibyendu Das and U. Ramakrishna.

Finally, the paper introducing semi-pruned SSA form, titled Practical Improvements to the Construction and Destruction of Static Single Assignment Form, proposes a slightly more efficient renaming algorithm.

So the SSA construction algorithm I implemented in the vocabulary is a combination of these three papers. First, I compute merge sets using the DJ-Graph algorithm, then, I use liveness information for placing phi instructions, and finally, I use the improved renaming algorithm.

SSA destruction

To get out of SSA form and back to imperative code which can be executed by a machine, phi instructions must be eliminated. The approach originally described by Cytron et al gives incorrect results in many cases; the correct algorithm is well-known now but it introduces many additional copy instructions. The classical approach for eliminating copies ("copy coalescing") is to do it as part of register allocation; if two values are related by a copy but do not otherwise interfere, they can be assigned to the same physical register, and the copy can be deleted from the program. This works for a graph-coloring approach, but with linear scan, you're limited in how much global knowledge you can have while performing register allocation, and accurate interference information is difficult to obtain.

Factor's register allocation performs some basic coalescing, mostly to eliminate copies arising from conversion to two-operand form on x86. However, phi nodes introduce copies with more complex interferences and my approach didn't work there, so even though stack analysis eliminated many memory to register and register to memory moves, the generated code had a large number of register to register moves, which is a bottleneck for instruction decoding, not to mention wastes valuable cache space.

Instead of attempting to coalesce copies arising from phi instructions in the register allocator, a more modern approach is to do this as part of SSA destruction -- instead of converting phi instructions to copies, the goal is to avoid inserting as many copies as possible in the first place.

A nice algorithm for SSA destruction with coalescing is detailed in the paper Fast copy coalescing and live-range identification, by Zoran Budimlic et al. The algorithm is very clever -- the two key results are a constant-time interference test between SSA values using dominance and live range information, and a linear-time interference test in a set of variables using "dominance forests", which are easy to construct and enable you to rule out most pairs of values for interference tests altogether.

This algorithm is implemented in LLVM (lib/CodeGen/StrongPHIElimination.cpp) and I essentially ported the C++ code to Factor. You can find the code in the compiler.cfg.ssa.destruction vocabulary.

I tweaked the algorithm a little -- instead of inserting sequential copies, I insert parallel copies, as detailed in the paper Revisiting Out-of-SSA Translation for Correctness, Code Quality and Efficiency by Benoit Boissinot et al. The parallel copy algorithm is implemented in compiler.cfg.parallel-copy. I used it not only in SSA destruction, but also to clean up some hackish approximations of the same problem in global stack analysis and the linear scan register allocator's resolve pass. I didn't implement the rest of the paper, because I found it hard to follow; it claims to provide a better algorithm than Budimlic et al, but the latter is good enough for now, and having a working implementation in the form of the LLVM pass was very valuable in implementing it in Factor.

This pass was very effective in eliminating copy instructions; it generates 75% of copies than the old phi elimination pass, which simply used the naive algorithm.

Loop head rotation

The last optimization I added eliminates an unconditional branch in some loops. Consider the following CFG:

A valid way to linearize the basic blocks is in reverse post order, that is 0, 1, 2, 3, 4, 5. However, with this linearization, block 3 has an unconditional branch back to block 2, and block 2 has a conditional which either falls through to 3 or jumps to 4. So on every loop iteration, two jumps are executed (the conditional jump at 2 and the unconditional jump at 3). If, instead, the CFG was linearized as 0, 1, 3, 2, 4, 5, then while 1 would have an unconditional jump to 2, 2 would have a conditional jump back to 3 and 3 would fall through to 2. So upon entry to the loop, an extra unconditional jump (from 1 to 3) executes, but on each iteration, there is just the single conditional backward jump at 2. This improves performance slightly and is easy to implement; the code is in the compiler.cfg.linearization.order vocabulary, and I borrowed the algorithm from SBCL's src/compiler/control.lisp.


There are some performance regressions I need to work out, because global stack analysis introduces too many partial redundancies for some types of code, and inline GC checks are currently disabled because of an unrelated issue I need to fix. It will take a few more days of tweaking to sort things out, and then I will post some benchmarks. Early results are already very promising on benchmarks with loops; not just the trivial counted loop example above.

Next steps are global float unboxing, unboxed 32-bit and 64-bit integer arithmetic, and SSE intrinsics. To help with the latter, Joe Groff was kind enough to add support for all SSE1/2/3/4 instructions to Factor's x86 assembler. Finally, thanks to Cameron Zwarich for pointing me at some of the papers I linked to above.

No comments: