Sunday, August 24, 2008

New optimizer

Roughly speaking, Factor's compiler has three main components:
  • Front-end -- convert quotations to compiler IR
  • Optimizer -- remove redundancies and unnecessary dynamism from IR
  • Code generator -- generate machine from IR

I'm finishing up with a rewrite of the first two components; this took just over a month and comprises 8200 lines of new code. I haven't pushed it yet, but I will in the next few days. I just want to test it a bit more before unleashing it on the world. A rewrite of the code generator will materialize before the year ends.

The new front-end

The new front-end is faster and emits a redesigned compiler IR. The main change to the IR is that it is now a static single assignment IR, and more control flow information is recorded via "phi nodes". The additional information enables more optimizations to be performed.

As before, the compiler IR is essentially stack code which is annotated with values, and with symbolic shuffles instead of shuffle words (for example, over is represented as something like a b -> a b a). A key property is that if we strip out all the value annotations, we again end up with stack code; I exploit this with a debugging tool which converts a quotation to IR, optimizes it, and converts it back to a quotation. This helps spot missed opportunities for optimization, as well as compiler bugs.

Even the code generator ignores the value annotations and only looks at the stack code; so the value annotations are only used by the optimizer itself, as it tracks the stack flow of values between words; and even the optimizer could dispense of them, at the cost of some algorithmic complexity. Because of this, one can think of the optimizer as a source-to-source transform; the sophistication and accuracy of the optimizations performed is unusual for such an approach, and this is surely a testament to the interesting properties and versatility of concatenative languages.

The new optimizer

The old optimizer was structured as a loop. Each iteration of the loop began by performing a series of analysis passes on the IR without actually changing the IR. Then, the actual optimization pass would run, iterating over every node in the IR. For each node, a series of optimization rules would be applied. If any nodes were changed, the main optimizer loop would run again. This way, it would iterate until all optimizations converge. This approach had several problems. First, it was inefficient; even if all optimizations were performed on the first pass, all analysis would run at least twice, and some of them were expensive. Second, it would miss optimization opportunities because it did not perform analysis concurrently with optimizations. The new optimizer is structured differently. It consists of 8 passes where each pass runs exactly once. Some passes only perform analysis, others only modify the IR based on prior analysis, others do both. Even though no pass runs more than once, the new optimizer can pick up optimization opportunities that the old one could not. It is also simpler, because each pass performs a single transformation, and faster.

First pass: normalization

This pass cleans up the IR output by the front-end to make it more amenable to subsequent analysis. This pass itself consists of several smaller passes, collecting information and making changes to the IR. Here is a description of two of the things it performs.

Suppose we are compiling this quotation:
[ 2 + * ]

Upon encountering the +, the front-end notices that a value must be popped from the stack because + needs 2 inputs, and again, with *. The front-end records this as #introduce nodes in the IR. For subsequent optimizations, it is advantageous to know how many inputs are expected on the stack at the very beginning of the quotation, so the normalize pass moves all #introduce nodes to the beginning of the IR and combines them into a single #introduce node.

Now, consider this quotation:
[ 0 swap [ + ] each ]

The each word is declared to have stack effect ( seq quot -- ), however in this case the quotation accesses one additional stack item -- we're using the stack as an implicit accumulator -- so this usage of each is really ( accum seq quot -- accum ). Being able to use the stack as an accumulator and thus turn each into a fold without any extra work is a great feature of Factor, however it requires some extra work on the compiler's part. The normalize pass analyzes inline recursive combinators and rewrites them as if the accumulator was explicit.

Second pass: propagation

This pass combines sparse conditional constant propagation, type inference, numerical range inference, and method dispatch elimination. All of this is combined into a single pass because the result is stronger than any order or any repetition of these passes, even if iterated until fixed point. This pass uses the "optimistic assumption" methodology: it begins by assuming that all code is unreachable, and all values have the "bottom" type (the type which no value can inhabit). As evidence contradicting these assumptions is collected, they are gradually relaxed; relaxing assumptions uncovers more evidence, and so on.

The old type inference was "pessimistic": it assumed all values had the "top type" (in Factor, object, meaning the type of all values) and that all code is reachable. The optimistic assumption derives stronger results than the pessimistic assumption.

The classical optimization which demonstrates the power of combining several analysis passes into one, is SCCP. We can construct a loop where a value is only seen to be a constant if unreachable code elimination is performed, but the code in question only becomes unreachable if a certain value is known to be a constant. Here is the code in Factor:
0 10 [ dup 0 > [ 1 + ] when ] times

After a quick examination of the above, it should be convincing that the value on the top of the stack will always remain 0, and the quotation [ 1 + ] is unreachable.

Suppose we perform unreachable code elimination and constant propagation separately, and furthermore, suppose we assume all code is reachable initially. Then we must conclude that the value on the top of the stack after an iteration of the while loop could potentially be any positive integer, because we either return the input or we add one to the previous value. Because of this, the test 0 > might be true or false, and the compiler would not be able to delete this branch.

On the other hand, SCCP will detect that on the first iteration, the test 0 > evaluates to false because the value on the stack is 0, and thus the branch is not taken, and the loop exits with 0 on the stack; thus the value is always zero, and the test always evaluates to false, and the branch is never taken.

Combining optimizations works nicely with method dispatch elimination also. For example, consider this contrived program:
GENERIC: my-generic ( obj -- obj' )

M: integer my-generic 1+ ;
M: string my-generic "!" append ;

: my-program ( -- ) 0 10 [ my-generic ] times . ;

Clearly, this is the same as:
: my-program ( -- ) 0 10 [ 1+ ] times . ;

Factor's old optimizer did not detect this: after all, the generic word could output potentially any type of object, and so the value at the top of the stack at the beginning of a loop iteration could also be any type, and the subsequent method inlining pass thus had no information to work with.

By combining type inference with optimistic method inlining, we see that on the first iteration, we are calling my-generic with an integer on the stack, and thus we can inline the method; then we know that 1+ applied to an integer yields an integer, thus the value on the stack is always an integer, thus the inlining decision was valid.

Of course, not all inlining decisions are valid; because the propagation pass operates under the optimistic assumption where inferred types of values only ever widen, sometimes the widening of a value's type invalidates prior inlining decisions. For example, suppose we had this instead:
GENERIC: my-generic ( obj -- obj' )

M: integer my-generic number>string ;
M: string my-generic "!" append ;

: my-program ( -- ) 0 10 [ my-generic ] times . ;

Here it is no longer valid to inline either method inside the loop, so when the propagation pass finishes, there will be no inlining in the IR.

This underscores an important difference between optimistic and pessimistic analysis; if we stop a pessimistic analysis half-way through, all inferred type are still valid, since they are, at worst, an over-estimate. An optimistic analysis must be carried out to completion, since stopping early will possibly leave us with assumptions which are too narrow. This is not a problem with Factor's optimizer, but it might be an issue in JITs where compile time is of the utmost importance and the compiler stops certain optimization passes half-way if it detects that they have run for too long.

The propagation pass also tracks types of immutable tuple slot values, and types of array lengths. The latter is very useful indeed, because as far as propagation is concerned, a "type" is a very rich notion, encompassing a value's class, but also a potential literal value, and if the type is a subtype of the real numbers, a potential range of values. Through a combination of type inference, method inlining, and unreachable code elimination, the propagation pass is able to eliminate the bounds check in the call to first in the following code, because it knows we are getting the first element of an array which is either 2 or 3 elements in length:
[ 2array ] [ 0 3array ] if first

Inference of tuple slot value types is a very important enabling optimization because it enables the subsequent escape analysis and tuple unboxing optimizations to operate on nested structure.

A more minor improvement is that while the old optimizer did not perform constant folding in this case, the new propagation pass does:
[ 2 2 + ] [ 4 ] if 4 *

I believe this was an oversight on my part, since nothing in the old design prevented it from exploiting this opportunity for optimization (it would have taken several iterations until fixed point was reached though, whereas the new optimizer only takes one.)

Third pass: cleanup

The propagation pass annotates IR nodes with type information, but it also modifies the IR by inlining methods and removing unreachable code. However, it cannot actually make destructive changes to the IR as it runs, because of the optimistic assumption; it may begin by assuming a certain branch of a conditional is never reached, however a subsequent discovery may lead it to reverse this assumption. Because of this, it does not make destructive changes to the IR, it only annotates the IR with assumptions which so far, have not been disproven. After propagation finishes, the assumptions which still remain are those which were shown to be valid. It is the job of the cleanup pass to finalize them; methods which are marked for inlining are actually inlined, branches which are still marked unreachable are deleted, and calls to words where the output is annotated as a constant are deleted and replaced with literals.

Fourth pass: loop detection

This is a very simple pass intended to help the compiler generate better code for inline recursive combinators by finding those which can share a call stack frame with their parent word. The general strategy is the same as previously, and I blogged about it when I first implemented it, however the algorithm itself is more efficient.

Fifth pass: escape analysis

The input to this pass now hopefully has as much generic dispatch as possible removed, and as many methods as possible inlined. There are a lot more calls to primitive words, and unsafe words, than there are in ordinary Factor code written by the user. Perhaps some tuples are now allocated in the same word where they are used, without ever being returned from the word. It is the job of the escape analysis pass to detect these tuples. I have written about escape analysis recently in great detail, so I won't describe the algorithm here, only how it fits in with the optimizer. Unlike the other passes, it does not modify the IR, it only collects information. Right now, there isn't a lot of code out there that is written with tuple unboxing in mind, so for the most part, escape analysis will be detecting curry and compose instances which don't really have to be constructed at runtime. Previously, this was done by the front-end, but it was fragile, messy, and not general enough.

Sixth pass: tuple unboxing

This pass uses information collected by escape analysis to eliminate unnecessary tuple allocations. If a tuple is determined to not escape, we delete its allocation from the IR, and replace all occurrences of the tuple's value in stack shuffle nodes with a series of slot values. Slot accesses become stack shuffles where the input is the set of slot values, and the output is the specific value being selected.

There is a parallel here between the propagation and cleanup passes, and the escape analysis and tuple unboxing passes. The first pass collects information and either annotates the IR (propagation) or stores it in an auxiliary structure (escape analysis). The second pass uses the collected information to rewrite the IR into something more efficient. The next two passes follow the same pattern; first we collect some information, then we rewrite the IR using this information.

Seventh pass: def-use chain computation

This pass builds a hashtable where the keys are values defined in the IR, and the values are pairs; the first item in the pair is the IR node which defines this value, and the second item in the pair is a sequence of IR nodes which use this value. In the Factor IR, each value is used at least once; even if a value is computed and never needed again, it will at least be dropped from the stack and thus used by a #shuffle node; if a value is computed and immediately returned from the word, it is used by a #return node.

Eighth pass: dead code elimination

This is the final pass; it uses def-use information to find values which are computed then never needed, and rewrites the code to delete them. Because each value is used at least once, this is not dead code elimination in the classical SSA sense; it is a bit more complex than that. In addition to deleting computations of unneeded values, it also has to modify stack shuffles and insert drops where appropriate. It also has to detect values which are computed, shuffled around on the stack, and then dropped far from the location where they're defined.

Just like propagation, dead code elimination uses the optimistic assumption. It begins by assuming that all values are dead, then iterates the IR, marking values as live once it detects that they escape from the word, continuing the analysis until no more values can be shown to be live.

The old optimizer's dead code elimination was much more simplistic (and pessimistic).

The new optimizer can rewrite
[ 4 + ] [ 5 ] if drop

[ ] [ ] if

(Empty conditionals are not deleted right now; maybe the code generator will do this later). The old optimizer did not see this opportunity.

The new optimizer can also rewrite
pop >r ... r> drop

pop drop ...

The old optimizer did nothing here, because pop has a side effect (it deletes the last element of the sequence, then pushes it on the stack), so it could not touch it or the value it produces. The new optimizer recognizes that the value does not need to travel around on the stack; since it is never needed again, we can just drop it immediately and remove it from all shuffles that it participates in, even if we cannot delete the call to the word which produces it.

The new optimizer also does a better job of eliminating unnecessary stack shuffles in general. Consider the following code:
[ + ] [ drop ] 2bi /

This type of thing is pretty common; calling cleave combinators and dropping some of the values in some of the quotations. After inlining, this gives us:
over over >r >r + r> r> drop /

There is some unnecessary shuffling here, because a value is placed on the retain stack only to be dropped later. However, this value is a copy of a value which is passed to + and so is live; the old optimizer gave up at this point, and did not differentiate between the different copies of a value. The new optimizer does, and even if one copy of a value is live, the other copies might still be dead and optimized away, and so the above code is optimized down to the following:
over >r + r> /

The code generator subsequently combines over >r into a single optimal instruction sequence which loads the second stack value and saves it on the retain stack.

In the event that type inference and inlining had replaced the call to + with a call to an fixnum+ or float+, there would be no stack traffic at all, and whether or not dead code elimination had eliminated that unnecessary retain stack shuffling would not make a difference. However, in the general case where the cleave combinator is calling non-inline words and not everything ends up in the same basic block, the optimization does make a difference.

Another situation where the new dead code elimination helps and the old one did nothing is the following common pattern; suppose we have a word do-it with stack effect ( a b -- ). Sometimes, people write
[ 2dup do-it ] when 2drop

Instead of
[ do-it ] [ 2drop ] if

After inlining, the former becomes
[ 2dup do-it ] [ ] if 2drop

Dead code elimination will detect that two of the four outputs of 2dup are dead, because they're subsequently dropped; it will then delete the call to 2dup and the 2drop it is paired with. To make the other branch correct, it inserts a call to 2drop in the other branch, and so we end up with:
[ do-it ] [ 2drop ] if

So whichever idiom you choose to use how has absolutely no effect on performance (admittedly, the difference was still negligible before, but every improvement counts.)

How I did it

I developed the new front-end and optimizer over a period of one month, without replacing the old optimizer until last week when I merged the changes in. I went with this approach because I wanted to have the full power of Factor's UI tools available while working on the compiler, and because I did not want to put up with random crashes or odd bugs because the compiler compiled itself incorrectly, and so on. However, with this development methodology, good unit tests were crucial. The old compiler only had 'live fire exercise' tests; it would compile a quotation, call it, and test that the result was as we would expect. The old tests were not immediately useful to me while I was working on the new optimizer, so I wrote a whole new set of tests which instead run individual compiler passes then perform various checks on the IR. I also developed a general IR checker which checked invariants; I caught many bugs this way before I even compiled a single line of code.

The future

For now, I will put the compiler aside to work on other things. I plan on revisiting the web framework, polishing the documentation, and fixing bugs for a little while. When I get bored of that, I will return to compiler hacking. The optimizer needs a little bit of extra polish; once this is done, I plan on doing some benchmarking and some tuning, and I expect some interesting results. The code generator rewrite will improve performance further. So far, I'm happy with the improvement in compile time alone; bootstrap went from 8 minutes to 6 minutes on my MacBook Pro.

1 comment:

Glenn Tarcea said...

Very interesting. I can't wait to try this out and to look at the code.