Thursday, July 09, 2009

Improvements to Factor's register allocator

For the last couple of months I've been working on some improvements to Factor's low-level optimizer and code generator. The major new change is that the control flow graph IR is now allowed to define registers that are used in more than one basic block. Previously, all values would be loaded from the data stack at the start of a basic block and stored to the stack at the end. Register allocation in this case is called local register allocation. Now the only time when values have to be saved to the stack is for subroutine calls. The part of the compiler most affected by this is of course the register allocator, because now it needs to do global register allocation.

I will describe the new optimizations in a future blog post and talk about global register allocation here. To help people who are learning Factor or are simply curious about what real Factor code looks like, I've inserted lots of links from this blog post to our online code repository browser.

Instead of rewriting the register allocator from scratch, I took a more incremental approach, gradually refactoring it to operate on the entire control flow graph instead of a basic block at a time. Generalizing the linear scan algorithm to do this is relatively straightforward, but there are some subtleties. Once again, I used Christian Wimmer's masters thesis as a guide, however this time around I implemented almost the entire algorithm described there, instead of the simplification to the local case.

First, recall some terminology: virtual registers are an abstraction of an ideal CPUs register file: there can be arbitrarily many of them, and each one is only ever assigned to once. The register allocator's job is to rewrite the program to use physical registers instead, possibly mapping a number of virtual registers to the same physical register, and inserting spills and reloads if there are insufficient physical registers to store all temporary values used by the program.

Linearization


A local register allocator operates on a basic block, which is a linear list of instructions. For the global case, linear scan still requires a linear list of instructions as input, so the first step is to traverse the CFG in reverse post order and number the instructions.

Building live intervals


Once a linear ordering among all instructions has been stablished, the second step in linear scan register allocation is building live intervals. In the local case, a virtual register is defined in the same basic block as all of its usages. Furthermore, if we assume single assignment, then there will be only a single definition, and it will precede all uses. So the set of instructions where the register is live is a single interval, which begins at the definition and ends at the last use.

In the global case, the situation is more complicated. First of all, because the linearization chosen is essentially arbitrary, and does not reflect the actual control flow in reality, instructions are not necessarily executed in order, so if a register is defined at position A and used at position B, it does not mean that the register is live at every position between A and B. Indeed, in the global case, the live "interval" of a register may consist of a number of disjoint ranges of instructions.

As an example, consider the following C program:
1: int x = ...;
2: if(...) {
3: ... a bunch of code that does not use x
4: } else {
5: int y = x + 1;
6: ... code that doesn't use x
7: }
8: ... more code that doesn't use x

The value x is live at position 1, and also at position 5. The last usage of x is at line 5, so it is not live at any position after 5, but also, it is not live at position 3 either, since if the true branch is taken, then x is never used at all. We say that x has a lifetime hole at position 3.

Another case where lifetime holes come up is virtual registers with multiple definitions. The very last step in the low level optimizer, right before register allocation, is phi node elimination. Essentially, phi node elimination turns this,
if(informal) {
x1 = "hi";
} else {
x2 = "hello";
}
x = phi(x1,x2);

into the following:
if(informal) {
x1 = "hi";
x = x1;
} else {
x2 = "hello";
x = x2;
}

Now, the value x is not live at the line x2 = "hello";, because it hasn't been defined yet, and the previous definition is not available either. So the lifetime interval for x has a hole at this location.

To represent such complex live intervals, the live-interval data type now contains a list of live ranges. Since basic blocks cannot contain control flow or multiple definitions, there can be at most one live range per basic block per live interval. (The only time multiple definitions come up is as a result of Phi node elimination, and since the input is in SSA form this cannot introduce multiple assignments in the same block).

Information from liveness analysis is used to construct live ranges. A basic block's live range contains the first definition of the register as well as the last use. If the register is in the "live in" set of the basic block, then the range begins at the block's first instruction, and if the register is in the "live out" set then the range ends at the block's last instruction. Note that even if a basic block has no usages of a register, it may appear in both the live in and live out sets; for example, the register might be defined in a predecessor block, and used in a successor block.

Allocating physical registers to live intervals


Once live intervals have been built, the next step is to assign physical registers to them. During this process, live intervals may be split, with register to memory, memory to register and register to register moves inserted in between; a single virtual register may be in different physical registers and memory locations during its lifetime.

In the local case, the allocation algorithm maintains two pieces of state; the "unhandled set", which is a heap of intervals sorted by start position, and an "active set", which is a list of intervals currently assigned to physical registers. The main algorithm removes the smallest element from the unhandled set, and then removes anything from the active set which ends before the new interval begins. The next step depends on whether or not any physical registers are free. If all physical registers are taken up in the active set, then a decision is made to spill either the new interval, or a member of the active set. Spilling will free up at least one physical register. At this point, a physical register is now free and can be assigned to the new interval, which is then added to the active set. Spilling may split intervals into smaller pieces, and add new elements to the unhandled set, but since the new split intervals are strictly smaller than the original interval, the process eventually terminates. Once the unhandled set is empty, allocation is complete.

The global case is more complicated, primarily due to the presence of lifetime holes in live intervals. A third piece of state is maintained by the allocation pass. This new set, the "inactive set", contains live intervals which have not ended yet, but are currently in a lifetime hole. To illustrate, suppose we have two physical registers, R1 and R2, and three virtual registers, V1, V2, and V3, with the following live intervals:

[Program start ....... Program end]
V1: <======> <=========>
V2: <==============>
V3: <================>
^
|
Current position

Immediately before V3's live interval begins, the active set includes V2, and the inactive set contains V1, since V1 is not finished yet, but the current position is in a lifetime hole. Suppose that V1 was assigned to physical register R1, and V2 was assigned to physical register R2. In the local case, since the active set only includes one element, V1, we could conclude that R2 was free, and that V3 could be assigned to R2. However, this is invalid, since at a later point, V1 becomes active again, and V1 uses R2. Two overlapping live intervals cannot use the same physical register. So global linear scan also has to examine the inactive set in order to see if the new live interval can fit in an inactive interval's lifetime hole. In this case, it cannot, so we have to split V3, and insert a copy instruction in the middle:
      [Program start ....... Program end]
V1: <======> <=========>
V2: <==============>
V3_1: <=========>
V3_2: <====>
^
|
Current position

With V3 split into V3_1 and V3_2, a valid register assignment is possible, with V3_1 stored in R1 and V3_2 stored in R2.

In this case, a register assignment without any spilling is possible. However, sometimes, virtual registers have to be stored to memory. This is done by the spilling code which uses a similar algorithm to the local case. The main complication is that in order to free up a physical register for the entire lifetime of the new interval, more than one interval may need to be split; zero or one active intervals, and zero or more inactive intervals.

Assigning physical registers to instructions


After live intervals are split, and physical registers are assigned to live intervals, the assignment pass iterates over all instructions, storing a mapping from virtual registers to physical registers in each instruction. Recall that this cannot be a global mapping, since a virtual register may reside in several different physical registers at different locations in the program.

The algorithm here is essentially unchanged from the local case. Two additions are that at the start and end of each basic block, the assignment pass records which registers are live. This information is used by GC checks. GC checks are performed on basic block boundaries of blocks which allocate memory. Since a register may be defined in one basic block, and used in another, with a GC check in between, GC checks need to store registers which contain pointers to a special location in the call stack and pass a pointer to this location to the GC. Some language implementations use a conservative GC to get around having to do this, but I think recording accurate pointer root information is a superior approach. This liveness information is also used by the resolve pass, coming up next.

The resolve pass


Since linear scan does not take control flow into account, interval splitting will give incorrect results if implemented naively. Consider the following example program:
x = ...;

if(...) {
...
y = x + 1;
... some code with high register pressure here
} else {
z = x + 2;
}

If the live interval of x was spilled immediately after the line y = x + 1; then the register allocator would rewrite it as follows -- it would be spilled after the last usage before the spill point, and reloaded before the first usage after the spill point:
x = ...;

if(...) {
...
y = x + 1;
spill(x);
... some code with high register pressure here
} else {
reload(x);
z = x + 2;
}

In this case however, the spill and reload locations are in different control flow paths. However, linear scan does not take control flow into account during the main algorithm. Instead, a subsequent resolve pass. The resolve pass looks at every edge in the control flow graph, and compares the physical register and spill slot assignments of all virtual registers which are both live across the edge. If they differ, moves, spills and reloads are inserted on the edge. Conceptually, this pass is simple, but it feels hackish -- it seems that a more elegant register allocator would incorporate this into the previous stages of allocation. However, all formulations of linear scan I've seen so far work this way. The only tricky part about the resolve pass is that the moves have to be performed "in parallel"; for example, suppose we have two virtual registers V1 and V2, and two physical registers R1 and R2. Block A and block B are joined by an edge. At the end of block A, V1 is in R1, and V2 is in R2. At the start of block B, V1 is in R2 and V2 is in R1. Doing the moves in any order would give the wrong result; for correct behavior, the cycle has to be broken with a third temporary location. The logic for doing this in the general case is a bit complex; Doug Coleman helped me implement the mapping code.

General observations


While implementing the register allocator improvements, I relied heavily on unit testing as well as an informal "design by contract"; basically, making extensive use of runtime assertions in the code itself. The compiler.cfg.linear-scan vocabulary weighs in at 3828 lines of code; 2484 lines of this are unit tests. This is certainly a much higher test:code ratio than most other Factor code I've written, and some of the tests are not very useful or redundant. However, they represent a log of features and bugs I've implemented, and at the very least, minimize regressions.

When testing the register allocator, I first got it working on x86-64. Here, there are enough registers that spilling almost never occurs, so I could focus on getting the primary logic correct. Then, I hacked the x86-64 backend to only use 4 physical registers, and debugged spilling.

The whole project took a bit longer than I had hoped, but I managed to start implementing a number of additional optimizations in the meantime. When they are more complete I will blog about them.

4 comments:

Sandro Magi said...
This comment has been removed by the author.
Sandro Magi said...

While industry standard, linear scan has been superceded by puzzle solving IMO.

Anonymous said...

Anythings better than the old ad hock "Regester Allocator" from a year ago.

intensive spanish class said...

How funny this is!!!

I was thinking this blog as literature blog found that post about computer language..hahaha

It's funny. Anyways it's good post. :)