Sunday, August 10, 2008

An algorithm for escape analysis

In compiler implementation, escape analysis is the process of discovering which values defined within a procedure are returned or passed down to another subroutine -- these are escaping values. Compound objects which are allocated inside the procedure, and which do not escape, are candidates for stack allocation. If applied in isolation, the main benefit of stack allocation is reduced GC overhead; however with generational garbage collection, this is not so important. The real payoff comes from combining stack allocation with unboxing -- if all usages of a compound object are replaced by its constituent values, further optimizations can be applied; the constituents might get stored in registers, or become candidates for value numbering, etc. Sometimes this optimization is referred to as scalar replacement.

I implemented escape analysis in the new compiler I'm working on for Factor. Since I couldn't find a description of the algorithm with the properties I wanted in the literature, I came up with my own and documented it here.

To start with, in Factor, almost everything is done with subroutine calls -- words are very short, and object slot access involves calling another word. And most objects pass through shuffle words on their way to their destination. So escape analysis only makes sense after inlining and type inference, and we also need to treat shuffle words specially, not as other subroutine calls. The latter is already taken care of, however; the compiler has special knowledge of shuffle words so that it can perform non-local optimizations.

A more subtle issue is that of writable slots. At present, my escape analysis only considers tuples where all slots are declared read-only as candidates for unboxing. Writing into slots creates difficulties; aliasing must be considered, and mixing the unboxing of mutable objects together with multi-shot continuations doesn't seem to make sense either (capturing a continuation copies the stack but not the heap; so if we restore a multishot continuation several times and read a mutable slot, we can observe whether unboxing took place or not by seeing if restoring the continuation restored its old value). Technically, I could extend it to unbox tuples whose slots are never written into, not just those whose slots are forced to be read-only. I will probably make this change if it avoids complicating the code too much.

Escape analysis is very beneficial in Factor; for example, iteration over a virtual sequence can be rewritten by the compiler to avoid allocating the virtual sequence altogether. It also enables programming idioms such as "cursors": you can represent the iteration state over a collection as an object, and have the object be unboxed at compile time, turning high-level functional code into tight loops in registers. In general, by reducing the cost of allocating temporary objects, better factoring is encouraged and cleaner code can be written.

The idea is basically to calculate which values may be unboxed, and for each such value, we replace its allocation with a no-op, replace each slot selection with a stack shuffle, and "widen" each shuffle that it passes through to permute its slots.

For example, suppose we define a new data type, the academic functional programmer's favorite,
TUPLE: cons { car read-only } { cdr read-only } ;
Now if we have this code,
: foo-0 ( -- a b ) 1 2 cons boa dup car>> swap cdr>> ;

We would expect the compiler to convert it to something like the following:
: foo-0' ( -- a b ) 1 2 2dup drop rot nip ;

The transformation proceeds as follows:
  • the constructor was removed altogether
  • the dup became 2dup because now the cons is two items on the stack
  • the first slot selection (car>>) became drop since this drops the second slot value while retaining the first one
  • the swap became rot since one of the inputs was the tuple, which is now two values on the stack
  • the second slot selection (cdr>>) became a nip since this drops the first slot value while retaining the second one

The codegen would then simplify the shuffling to a no-op, and the word would become nothing more than a push of two literal values on the stack.

To formalize the above, we need to come up with a good definition for a value which "escapes". We formulate this in terms of the inputs and outputs of the various nodes in the high-level SSA IR of the Factor compiler. Since this is SSA, each value is defined only once, so we may identity a value with its definition; we can say that a value is the result of a specific operation, such as a memory allocation.

First, every value has an entry in the allocation map. The entry is either f, meaning the value's definition has not been seen yet, t, meaning the value is allocated outside of this word, or a sequence of values, meaning that this value is a tuple allocation whose slots are from this sequence. When the analysis is done, no value has an entry of f, however while the analysis is in progress, we may encounter such cases, and they have to be handled -- basically, an entry of f means "this entry may change to anything else in the future".

Clearly, a value which is returned from the word in question escapes -- these values are stored as the input to a #return node in the compiler IR. Also, values which are inputs to subroutine calls escape -- these are the #call nodes. One exception is that the slot selection primitive slot, when called with a constant integer argument, does not force its input value to escape; after all, we need to have some way of accessing slots of stack-allocated objects. Calls to this primitive appear after type inference allows slot accessors to be inlined. Finally, values which are inputs to a call of the tuple construction primitive <tuple-boa> are not considered to escape either; we would like to unbox nested (but not circular) structure.

In the following two, the allocation escapes:
: foo-1 ( -- cons ) 1 2 cons boa ;

: foo-2 ( -- ) 1 2 cons boa . ;

Here it doesn't -- after inlining, car>> becomes 3 slot:
: foo-3 ( -- obj ) 1 2 cons boa car>> ;

In the following, neither of the two allocations escape:
: foo-4 ( -- obj ) 1 2 cons boa 3 cons boa car>> car>> ;

Now we run into our first complication. Consider the following code:
: foo-5 ( -- cons ) 1 2 cons boa 3 cons boa car>> ;

The first allocation escapes, the second one does not. Now clearly the result of car>> (3 slot after inlining) escapes by our above criteria; how do we link it with the output of 1 2 cons boa? Well, we can construct an undirected graph where the vertices are values, and two vertices are linked by an edge if a sufficient and necessary condition for one of them to escape is for the other to escape. We can thus look at connected components of this graph instead of individual values -- if any member of the connected component escapes, they all do. I call this graph the "escaping values graph".

So in the above example foo-5, we add an edge between the first input to the second cons boa call, and the output of car>>, and when we encounter the #return node and see that the output of car>> escapes, we will also know that its input escapes; hence the first allocation escapes, but the second one does not.

A number of nodes in the IR are created when shuffle words are converted -- these are #>r, #r>, and #shuffle. They do not make values escape, but they do define new values (this is SSA), and whether those values escape or not depends on whether their inputs escape, and vice versa.

This allows us to catch that the following allocation escapes:
: foo-6 ( -- cons ) 1 2 cons boa >r "Hello world" print r> ;

but the following doesn't:
: foo-7 ( -- obj ) 1 2 cons boa >r "Hello world" print r> cdr>> ;

A related type of node is the #copy node, which arises as a result of other optimizations. We also add edges to the escaping values graph mapping its inputs to its outputs.

This pretty much sums it up for straight-line code without branches or recursion. You build a value graph to track value equivalence, mark certain values as escaping when you encounter call and return nodes, then when you're done, you know that the allocations whose connected components in the graph do not contain escaping values can be safely unboxed.

The situation with conditionals is trickier. Suppose you have a word bar which outputs a cons, but cannot be inlined for whatever reason, and you use the word as follows:
: foo-8 ( ? -- obj ) [ 1 2 cons boa ] [ bar ] if car>> ;

Here, the first allocation does not escape -- the only place it can ever be used is the car>>; but if we replace it with two items on the stack and change the call to car>> into a stack shuffle, then in the event of the second branch being taken, we would get incorrect behavior; here we certainly do want the accessor to be called, and we don't expect bar to return two values on the stack either.

So clearly, in the above, we have to somehow infer that the result of the allocation escapes, even though it is never an input to a call or a return. How do we do this? Well, here is the SSA form for the above, in somewhat stylized, C-like form, with x the top of the stack:
if(x)
y1 = cons(1,2);
else
y2 = bar();
y = phi(y1,y2);
z = y.car;
return (z);

The problem here is the #phi node has two inputs, one of which has an entry in the allocation map {y1,y2}, and the other one has a value of t, since it is allocated outside of this word. Therefore, we cannot unbox y1, since that would mess up the #phi node. So we add a check for this; if the inputs to a #phi nodes have incompatible allocation map entries, either because of them is a sequence of values and the other is t, or because they are both sequences of values of different length, we mark all values as escaping, ruling out any unboxing. On the other hand, if all entries are compatible, we have to add new #phi nodes to unify the slots. Even if unboxing does not take place for other reasons, these #phi nodes are important to track relations between tuple slots.

For example, consider this program:
: foo-9 ( ? -- obj ) [ 1 2 cons boa ] [ 3 4 cons boa ] if car>> ;

The two allocations are compatible and the compiler should rewrite the above into roughly the following:
: foo-9' ( ? -- obj ) [ 1 ] [ 3 ] if ;

Here is the SSA form of foo-9:
if(x)
{
a1 = 1;
b1 = 2;
y1 = cons(a1,b1);
}
else
{
a2 = 3;
b2 = 4;
y2 = cons(a2,b2);
}
y = phi(y1,y2);
z = y.car;
return (z);

Escape analysis would add the following entries to the allocation map:
a1 => t
a1 => t
y1 => (a1,b1)
a2 => t
a3 => t
y2 => (a2,b2)

a3 => t
b3 => t
y => (a3,b3)

The last three entries are added as a result of the #phi node being traversed.

As for the value graph, the slot access would link z with a3 as soon as we encounter the statement z = y.car;.

However, we actually need to add more edges to the value graph than just this one. Consider the following program, almost the same as foo-9 except we return the cons cells on the stack:
: foo-10 ( ? -- cons ) [ 1 2 cons boa ] [ 3 4 cons boa ] if ;

Here, we cannot unbox either value and both allocations escape. However, while we would end up marking the output of the #phi as an escaping value, nothing would be done to the outputs of cons boa, resulting in incorrect code. So what we need to do is link the inputs and outputs of a phi in the escaping values graph, thus if any one of them escapes, they all escape. Here is the SSA form for foo-10:
if(x)
{
a1 = 1;
b1 = 2;
y1 = cons(a1,b1);
}
else
{
a2 = 3;
b2 = 4;
y2 = cons(a2,b2);
}
y = phi(y1,y2);

/* Inserted by escape analysis */
a3 = phi(a1,a2);
b3 = phi(b1,b2);

return (y);

Here is the allocation map:
a1 => t
a1 => t
y1 => (a1,b1)
a2 => t
a3 => t
y2 => (a2,b2)
a3 => t
b3 => t
y => (a3,b3)

And here are the edges of the escaping values graph for foo-10:
a3 -- a1
a3 -- a2
b3 -- b1
b3 -- b2
y -- y1
y -- y2

Because y escapes, none of the values in its connected component -- namely, y, y1, and y2 -- are candidates for unboxing.

This approach handles all other corner cases involving branches also:
: foo-11 ( ? -- cons ) [ 1 2 cons boa dup . ] [ 3 4 cons boa ] if car>> ;

Here, the first allocation cannot be unboxed, and this prevents the second allocation from being unboxed since they meet at a #phi node.

Here is another example:
: foo-12 ( ? -- cons ) [ 1 2 cons 3 cons boa ] [ bar 4 cons boa ] if car>> car>> ;

What happens here? Well, let's look at the SSA form:
if(x)
{
a1 = 1;
b1 = 2;
y1 = cons(a1,b1);
c1 = 3;
z1 = cons(y1,c1);
}
else
{
y2 = bar();
c2 = 4;
z2 = cons(y2,c2);
}

z = phi(z1,z2);

/* Inserted by escape analysis */
y3 = phi(y1,y2);
c3 = phi(c1,c3);

u = z.car;
v = u.car;

return(u);

Here is the allocation map:
a1 => t
b1 => t
y1 => {a1,b1}
c1 => t
z1 => {y1,c1}
y2 => t
c2 => t
z2 => {y2,c2}
y3 => t
c3 => t
z => {y3,c3}
u => t
v => t

Here are the edges of the value graph:
y3 -- y1
y3 -- y2
z -- z1
z -- z2
v -- y3

Note that while y1 does not escape, it is in the same connected component as y2, which does escape, therefore y1 cannot be unboxed. However, z1 and z2 can both be unboxed.

Recursion and loops arising from usage of combinators such as each create #phi nodes where some of the inputs are defined after the #phi node in a post-order traversal of the dominator graph.

For example,
10 [ . ] each

yields the following SSA form:
int x0 = 0;
int x1 = 10;

L: x2 = phi(x0,x3);
if(x2 < x1)
{
.(x2);
x3 = x2 + 1;
goto L;
}

Note that x3 is defined after the #phi node referencing it.

When the escape allocation encounters, some of the inputs have entries of f in the allocation map. These values are ignored when checking for compatibility; we optimistically assume that they will later be defined in a way which does not force the value to be escaping. After the first iteration of the analysis, all values should now have non-f entries in the allocation map. We repeat the analysis until the allocation map entries of the outputs of #phi nodes stop changing; while it is possible to construct code where this can take any number of iterations, in real examples it should terminate after one or two passes.

For example, in the following, the allocations can be unboxed:
: foo-13 ( n -- obj ) 1 2 cons boa swap [ drop 3 4 cons boa ] times car>> ;

We'd expect the compiler to convert the code into the following:
: foo-13' ( n -- obj ) 1 2 rot [ 2drop 3 4 ] times drop ;

Here is the SSA form:
int a1 = 1;
int b1 = 2;
int y1 = cons(a1,b1);

int x0 = 0;

L: x2 = phi(x1,x3);

/* Inserted by escape analysis */
a3 = phi(a1,a2);
b3 = phi(b1,b2);

y2 = phi(y1,y3);
if(x2 < n)
{
x3 = x2 + 1;
a2 = 3;
b2 = 4;
y3 = cons(a2,b2);
goto L;
}

z = y2.car;
return(z);

Here is the allocation map:
a1 => t
b1 => t
y1 => {a1,b1}
x0 => t
x2 => t
a3 => t
b3 => t
y2 => {a3,b3}
x3 => t
a2 => t
b2 => t
y3 => {a2,b2}
z => t

And the edges of the escaping values graph:
a3 -- a1
a3 -- a2
x2 -- x1
x2 -- x3
b3 -- b1
b3 -- b2
y2 -- y1
y2 -- y3

The only value marked as escaping is z, and its connected component only contains itself, so the two allocations may be unboxed.

On the other hand, if we have
: foo-14 ( n -- obj ) 1 2 cons boa [ 3 cons boa ] times car>> ;

Unboxing cannot take place. We cannot unbox the second allocation, because it builds recursive structure, and we cannot unbox the first because its result is stored inside the second allocation, which is not unboxed. Here is the SSA form:
int a1 = 1;
int b1 = 2;
int y1 = cons(a1,b1);

int x0 = 0;

L: x2 = phi(x1,x3);
y2 = phi(y1,y3);

/* Inserted by escape analysis */
a3 = phi(y2,a1);
b3 = phi(b1,a2);

if(x2 < n)
{
x3 = x2 + 1;
a2 = 3;
y3 = cons(y2,a2);
goto L;
}

And the allocation map:
a1 => t
b1 => t
y1 => {a1,b1}
x0 => t
x2 => t
a3 => t
b3 => t
y2 => {a3,b3}
x3 => t
a2 => t
y3 => {y2,a2}
z => t

And the edges of the escaping values graph:
a3 -- a1
a3 -- y2
x2 -- x1
x2 -- x3
b3 -- b1
b3 -- a2
y2 -- y1
y2 -- y3

We see that y2 and y3 are in the same connected component as a3, which has an allocation map entry of t, therefore neither one may be unboxed.

As far as the implementation goes, the above gives a high-level overview. The only tricky part is the implementation of the escaping values graph. I used a disjoint set (union find); adding an edge adds equivalences, and marking a value as an escaping value makes it equivalent to a special "escaping" object. Checking if a value is in the same connected component as an escaping value then is just a check to see if the value is equivalent to "escaping". A disjoint set is a slightly exotic data structure which perform this check in what is essentially constant time.

I implemented this algorithm; for now, you can find it in the unfinished/compiler/tree/escape-analysis directory. The pass which actually performs the unboxing using the information collected during escape analysis is in unfinished/compiler/tree/tuple-unboxing. In comparison to the analysis, the rewrite stage is pretty simple, so I won't describe the details here.

1 comment:

Unknown said...

When the cons example is introduced, I think rot should be -rot. 2dup drop rot nip is equivalent to swap, but 2dup drop -rot nip is equivalent to the identity.