Sunday, August 31, 2008

Factor is now five years old

Time flies! It is hard to believe that another year has gone by and it is now five years since I began to work on the project that would become Factor.

Community


Really, the most important thing that has happened in the last year is that there are some new contributors around. Big shout-out to James Cash, William Schlieper, Joe Groff, Phil Dawes, Aaron Schaefer, Alfredo Beaumont and anybody else who joined the project within the last year. Along with the long-time contributors who were there from the very beginning -- Doug Coleman, Eduardo Cavazos, Chris Double, Daniel Ehrenberg, and everyone else who has helped out over the years -- you people make Factor fun. I'm honored to be working with such high-caliber programmers as the ones in the Factor community: they are the most interesting, intelligent and creative people I know in the software trade. I never anticipated having many people show any interest in Factor at all, but nowadays it is common to see more than 50 people in the #concatenative channel of irc.freenode.net.

Switch to Git


At the end of last year, we switched to git from darcs for revision control. I'd like to think the changeover was a complete success -- git has many powerful features and the five-minute-long "darcs push" and "darcs pull" waits are a thing of the past.

Continuous integration


Instead of having manual releases once in a while and keeping the repository in various states of breakage between releases, we've switched to a model of continuous builds and automatic binary packages. We have binary packages for 11 platforms. For those that prefer to build from source, known-good automatically-updated clean branches are available. Stability has improved with continuous integration, since unit tests are being run all the time; and the days of checking out Factor from the repository only to have bootstrap fail are long gone.

Soon we will push ahead and add even more platforms; in the near future we will be adding the missing 64-bit and PowerPC systems and the total number of supported platforms with binary packages will increase to 17. Write once, run anywhere.

Language changes


Some of the language changes over the last year have helped make Factor more expressive and maintainable; idiomatic Factor from today is a huge improvement over the state of affairs a year ago. The main improvements are enumerated below, but there were countless others:

While language evolution was a big focus of mine for the first few months of the year, I have taken a break from major language changes over the last six months ago to focus on libraries, the compiler, as well as stability. This focus paid off as many bugs and performance problems were fixed; very soon I will start moving forward with the language again, with multiple dispatch at the top of the list.

Library improvements


Many notable new libraries were added, including:

The I/O system alone saw some major progress:

The implementation


Along with improvements to the language and library, we've had major progress on the low-level plumbing. Since the goal is to make Factor ready for high-load production applications, I've had to learn a lot about garbage collection and compilers over the last five years, and especially the last year alone.

The biggest change in the last year is that Factor is no longer a dual interpreted/compiled implementation. The interpreter has been replaced with a non-optimizing compiler and the optimizing compiler saw some major improvements (this, this, and this). There have also been countless improvements to the GC, and algorithmic tuning of the core libraries.

The web site


We now have a community blog aggregator, and a pastebin for the IRC channel. Since we eat our own dogfood, both of those are, naturally, written in Factor and running on the Factor HTTP server, along with the main Factor website.

Line counts


Every year, I've given some line number counts for the Factor codebase. Here are the latest stats:
  • Factor VM: 12050 lines of C, 391 lines of assembly
  • Factor core: 9077 lines of Factor
  • Factor basis library: 64311 lines of Factor
  • Factor extra library: 50825 lines of Factor
  • Unit tests: 24308 lines of Factor
  • Documentation: 242426 words of help markup

Compared to last year,
  • The VM has grown by about 2000 lines of code; this isn't much considering there have been many improvements made, and the interpreter was replaced by a simple JIT
  • Parts of core and extra were split off and moved to a new basis library which includes useful libraries, tools, the UI, and the compiler; the code that remains in the core is really the fundamentals required to parse and execute Factor code and not much else
  • The amount of library code in basis and extra has increased considerably

Line counts are often meaningless, but in this case we see a very good trend; the lowest levels of Factor, that is the VM and the core library, are certainly not bloating at all; on the other hand, many more libraries are being added, and more and more Factor is starting to complete with more established languages in terms of library coverage.

Last year I made the claim that Factor 1.0 would be released this year. This was perhaps too optimistic. While slapping a "1.0" sticker onto the tree could be theoretically done at any time, I want to release a 1.0 that I and everyone else in the community is truly happy with. I remember how underwhelming Java 1.0 was; I don't want to make that mistake. With continuous builds, anyone can try out and test Factor and stability is improving all the time, so there is no need to rush forward and declare my job done prematurely. Having said that, I tentatively give myself another year until 1.0.

The rate of progress over the last year has been amazing. I'm looking forward to yet another year of Factor.

As a side note, I began jEdit development some time in August 1998. So this month jEdit turns 10 years old! I stopped maintaining jEdit in late 2005 and passed the torch to a new development team who remain active to this day; a new release was made a few weeks ago. With 2,000 downloads daily, and more than 4.5 million downloads in all, jEdit's success continues to amaze me. It is still my preferred text editor and I learned almost everything I know about software engineering from working on jEdit, with much of the rest coming from my experience working on Factor.

Sunday, August 24, 2008

New optimizer

Roughly speaking, Factor's compiler has three main components:
  • Front-end -- convert quotations to compiler IR
  • Optimizer -- remove redundancies and unnecessary dynamism from IR
  • Code generator -- generate machine from IR

I'm finishing up with a rewrite of the first two components; this took just over a month and comprises 8200 lines of new code. I haven't pushed it yet, but I will in the next few days. I just want to test it a bit more before unleashing it on the world. A rewrite of the code generator will materialize before the year ends.

The new front-end


The new front-end is faster and emits a redesigned compiler IR. The main change to the IR is that it is now a static single assignment IR, and more control flow information is recorded via "phi nodes". The additional information enables more optimizations to be performed.

As before, the compiler IR is essentially stack code which is annotated with values, and with symbolic shuffles instead of shuffle words (for example, over is represented as something like a b -> a b a). A key property is that if we strip out all the value annotations, we again end up with stack code; I exploit this with a debugging tool which converts a quotation to IR, optimizes it, and converts it back to a quotation. This helps spot missed opportunities for optimization, as well as compiler bugs.

Even the code generator ignores the value annotations and only looks at the stack code; so the value annotations are only used by the optimizer itself, as it tracks the stack flow of values between words; and even the optimizer could dispense of them, at the cost of some algorithmic complexity. Because of this, one can think of the optimizer as a source-to-source transform; the sophistication and accuracy of the optimizations performed is unusual for such an approach, and this is surely a testament to the interesting properties and versatility of concatenative languages.

The new optimizer


The old optimizer was structured as a loop. Each iteration of the loop began by performing a series of analysis passes on the IR without actually changing the IR. Then, the actual optimization pass would run, iterating over every node in the IR. For each node, a series of optimization rules would be applied. If any nodes were changed, the main optimizer loop would run again. This way, it would iterate until all optimizations converge. This approach had several problems. First, it was inefficient; even if all optimizations were performed on the first pass, all analysis would run at least twice, and some of them were expensive. Second, it would miss optimization opportunities because it did not perform analysis concurrently with optimizations. The new optimizer is structured differently. It consists of 8 passes where each pass runs exactly once. Some passes only perform analysis, others only modify the IR based on prior analysis, others do both. Even though no pass runs more than once, the new optimizer can pick up optimization opportunities that the old one could not. It is also simpler, because each pass performs a single transformation, and faster.

First pass: normalization


This pass cleans up the IR output by the front-end to make it more amenable to subsequent analysis. This pass itself consists of several smaller passes, collecting information and making changes to the IR. Here is a description of two of the things it performs.

Suppose we are compiling this quotation:
[ 2 + * ]

Upon encountering the +, the front-end notices that a value must be popped from the stack because + needs 2 inputs, and again, with *. The front-end records this as #introduce nodes in the IR. For subsequent optimizations, it is advantageous to know how many inputs are expected on the stack at the very beginning of the quotation, so the normalize pass moves all #introduce nodes to the beginning of the IR and combines them into a single #introduce node.

Now, consider this quotation:
[ 0 swap [ + ] each ]

The each word is declared to have stack effect ( seq quot -- ), however in this case the quotation accesses one additional stack item -- we're using the stack as an implicit accumulator -- so this usage of each is really ( accum seq quot -- accum ). Being able to use the stack as an accumulator and thus turn each into a fold without any extra work is a great feature of Factor, however it requires some extra work on the compiler's part. The normalize pass analyzes inline recursive combinators and rewrites them as if the accumulator was explicit.

Second pass: propagation


This pass combines sparse conditional constant propagation, type inference, numerical range inference, and method dispatch elimination. All of this is combined into a single pass because the result is stronger than any order or any repetition of these passes, even if iterated until fixed point. This pass uses the "optimistic assumption" methodology: it begins by assuming that all code is unreachable, and all values have the "bottom" type (the type which no value can inhabit). As evidence contradicting these assumptions is collected, they are gradually relaxed; relaxing assumptions uncovers more evidence, and so on.

The old type inference was "pessimistic": it assumed all values had the "top type" (in Factor, object, meaning the type of all values) and that all code is reachable. The optimistic assumption derives stronger results than the pessimistic assumption.

The classical optimization which demonstrates the power of combining several analysis passes into one, is SCCP. We can construct a loop where a value is only seen to be a constant if unreachable code elimination is performed, but the code in question only becomes unreachable if a certain value is known to be a constant. Here is the code in Factor:
0 10 [ dup 0 > [ 1 + ] when ] times

After a quick examination of the above, it should be convincing that the value on the top of the stack will always remain 0, and the quotation [ 1 + ] is unreachable.

Suppose we perform unreachable code elimination and constant propagation separately, and furthermore, suppose we assume all code is reachable initially. Then we must conclude that the value on the top of the stack after an iteration of the while loop could potentially be any positive integer, because we either return the input or we add one to the previous value. Because of this, the test 0 > might be true or false, and the compiler would not be able to delete this branch.

On the other hand, SCCP will detect that on the first iteration, the test 0 > evaluates to false because the value on the stack is 0, and thus the branch is not taken, and the loop exits with 0 on the stack; thus the value is always zero, and the test always evaluates to false, and the branch is never taken.

Combining optimizations works nicely with method dispatch elimination also. For example, consider this contrived program:
GENERIC: my-generic ( obj -- obj' )

M: integer my-generic 1+ ;
M: string my-generic "!" append ;

: my-program ( -- ) 0 10 [ my-generic ] times . ;

Clearly, this is the same as:
: my-program ( -- ) 0 10 [ 1+ ] times . ;

Factor's old optimizer did not detect this: after all, the generic word could output potentially any type of object, and so the value at the top of the stack at the beginning of a loop iteration could also be any type, and the subsequent method inlining pass thus had no information to work with.

By combining type inference with optimistic method inlining, we see that on the first iteration, we are calling my-generic with an integer on the stack, and thus we can inline the method; then we know that 1+ applied to an integer yields an integer, thus the value on the stack is always an integer, thus the inlining decision was valid.

Of course, not all inlining decisions are valid; because the propagation pass operates under the optimistic assumption where inferred types of values only ever widen, sometimes the widening of a value's type invalidates prior inlining decisions. For example, suppose we had this instead:
GENERIC: my-generic ( obj -- obj' )

M: integer my-generic number>string ;
M: string my-generic "!" append ;

: my-program ( -- ) 0 10 [ my-generic ] times . ;

Here it is no longer valid to inline either method inside the loop, so when the propagation pass finishes, there will be no inlining in the IR.

This underscores an important difference between optimistic and pessimistic analysis; if we stop a pessimistic analysis half-way through, all inferred type are still valid, since they are, at worst, an over-estimate. An optimistic analysis must be carried out to completion, since stopping early will possibly leave us with assumptions which are too narrow. This is not a problem with Factor's optimizer, but it might be an issue in JITs where compile time is of the utmost importance and the compiler stops certain optimization passes half-way if it detects that they have run for too long.

The propagation pass also tracks types of immutable tuple slot values, and types of array lengths. The latter is very useful indeed, because as far as propagation is concerned, a "type" is a very rich notion, encompassing a value's class, but also a potential literal value, and if the type is a subtype of the real numbers, a potential range of values. Through a combination of type inference, method inlining, and unreachable code elimination, the propagation pass is able to eliminate the bounds check in the call to first in the following code, because it knows we are getting the first element of an array which is either 2 or 3 elements in length:
[ 2array ] [ 0 3array ] if first

Inference of tuple slot value types is a very important enabling optimization because it enables the subsequent escape analysis and tuple unboxing optimizations to operate on nested structure.

A more minor improvement is that while the old optimizer did not perform constant folding in this case, the new propagation pass does:
[ 2 2 + ] [ 4 ] if 4 *

I believe this was an oversight on my part, since nothing in the old design prevented it from exploiting this opportunity for optimization (it would have taken several iterations until fixed point was reached though, whereas the new optimizer only takes one.)

Third pass: cleanup


The propagation pass annotates IR nodes with type information, but it also modifies the IR by inlining methods and removing unreachable code. However, it cannot actually make destructive changes to the IR as it runs, because of the optimistic assumption; it may begin by assuming a certain branch of a conditional is never reached, however a subsequent discovery may lead it to reverse this assumption. Because of this, it does not make destructive changes to the IR, it only annotates the IR with assumptions which so far, have not been disproven. After propagation finishes, the assumptions which still remain are those which were shown to be valid. It is the job of the cleanup pass to finalize them; methods which are marked for inlining are actually inlined, branches which are still marked unreachable are deleted, and calls to words where the output is annotated as a constant are deleted and replaced with literals.

Fourth pass: loop detection


This is a very simple pass intended to help the compiler generate better code for inline recursive combinators by finding those which can share a call stack frame with their parent word. The general strategy is the same as previously, and I blogged about it when I first implemented it, however the algorithm itself is more efficient.

Fifth pass: escape analysis


The input to this pass now hopefully has as much generic dispatch as possible removed, and as many methods as possible inlined. There are a lot more calls to primitive words, and unsafe words, than there are in ordinary Factor code written by the user. Perhaps some tuples are now allocated in the same word where they are used, without ever being returned from the word. It is the job of the escape analysis pass to detect these tuples. I have written about escape analysis recently in great detail, so I won't describe the algorithm here, only how it fits in with the optimizer. Unlike the other passes, it does not modify the IR, it only collects information. Right now, there isn't a lot of code out there that is written with tuple unboxing in mind, so for the most part, escape analysis will be detecting curry and compose instances which don't really have to be constructed at runtime. Previously, this was done by the front-end, but it was fragile, messy, and not general enough.

Sixth pass: tuple unboxing


This pass uses information collected by escape analysis to eliminate unnecessary tuple allocations. If a tuple is determined to not escape, we delete its allocation from the IR, and replace all occurrences of the tuple's value in stack shuffle nodes with a series of slot values. Slot accesses become stack shuffles where the input is the set of slot values, and the output is the specific value being selected.

There is a parallel here between the propagation and cleanup passes, and the escape analysis and tuple unboxing passes. The first pass collects information and either annotates the IR (propagation) or stores it in an auxiliary structure (escape analysis). The second pass uses the collected information to rewrite the IR into something more efficient. The next two passes follow the same pattern; first we collect some information, then we rewrite the IR using this information.

Seventh pass: def-use chain computation


This pass builds a hashtable where the keys are values defined in the IR, and the values are pairs; the first item in the pair is the IR node which defines this value, and the second item in the pair is a sequence of IR nodes which use this value. In the Factor IR, each value is used at least once; even if a value is computed and never needed again, it will at least be dropped from the stack and thus used by a #shuffle node; if a value is computed and immediately returned from the word, it is used by a #return node.

Eighth pass: dead code elimination


This is the final pass; it uses def-use information to find values which are computed then never needed, and rewrites the code to delete them. Because each value is used at least once, this is not dead code elimination in the classical SSA sense; it is a bit more complex than that. In addition to deleting computations of unneeded values, it also has to modify stack shuffles and insert drops where appropriate. It also has to detect values which are computed, shuffled around on the stack, and then dropped far from the location where they're defined.

Just like propagation, dead code elimination uses the optimistic assumption. It begins by assuming that all values are dead, then iterates the IR, marking values as live once it detects that they escape from the word, continuing the analysis until no more values can be shown to be live.

The old optimizer's dead code elimination was much more simplistic (and pessimistic).

The new optimizer can rewrite
[ 4 + ] [ 5 ] if drop

into
[ ] [ ] if

(Empty conditionals are not deleted right now; maybe the code generator will do this later). The old optimizer did not see this opportunity.

The new optimizer can also rewrite
pop >r ... r> drop

into
pop drop ...

The old optimizer did nothing here, because pop has a side effect (it deletes the last element of the sequence, then pushes it on the stack), so it could not touch it or the value it produces. The new optimizer recognizes that the value does not need to travel around on the stack; since it is never needed again, we can just drop it immediately and remove it from all shuffles that it participates in, even if we cannot delete the call to the word which produces it.

The new optimizer also does a better job of eliminating unnecessary stack shuffles in general. Consider the following code:
[ + ] [ drop ] 2bi /

This type of thing is pretty common; calling cleave combinators and dropping some of the values in some of the quotations. After inlining, this gives us:
over over >r >r + r> r> drop /

There is some unnecessary shuffling here, because a value is placed on the retain stack only to be dropped later. However, this value is a copy of a value which is passed to + and so is live; the old optimizer gave up at this point, and did not differentiate between the different copies of a value. The new optimizer does, and even if one copy of a value is live, the other copies might still be dead and optimized away, and so the above code is optimized down to the following:
over >r + r> /

The code generator subsequently combines over >r into a single optimal instruction sequence which loads the second stack value and saves it on the retain stack.

In the event that type inference and inlining had replaced the call to + with a call to an fixnum+ or float+, there would be no stack traffic at all, and whether or not dead code elimination had eliminated that unnecessary retain stack shuffling would not make a difference. However, in the general case where the cleave combinator is calling non-inline words and not everything ends up in the same basic block, the optimization does make a difference.

Another situation where the new dead code elimination helps and the old one did nothing is the following common pattern; suppose we have a word do-it with stack effect ( a b -- ). Sometimes, people write
[ 2dup do-it ] when 2drop

Instead of
[ do-it ] [ 2drop ] if

After inlining, the former becomes
[ 2dup do-it ] [ ] if 2drop

Dead code elimination will detect that two of the four outputs of 2dup are dead, because they're subsequently dropped; it will then delete the call to 2dup and the 2drop it is paired with. To make the other branch correct, it inserts a call to 2drop in the other branch, and so we end up with:
[ do-it ] [ 2drop ] if

So whichever idiom you choose to use how has absolutely no effect on performance (admittedly, the difference was still negligible before, but every improvement counts.)

How I did it


I developed the new front-end and optimizer over a period of one month, without replacing the old optimizer until last week when I merged the changes in. I went with this approach because I wanted to have the full power of Factor's UI tools available while working on the compiler, and because I did not want to put up with random crashes or odd bugs because the compiler compiled itself incorrectly, and so on. However, with this development methodology, good unit tests were crucial. The old compiler only had 'live fire exercise' tests; it would compile a quotation, call it, and test that the result was as we would expect. The old tests were not immediately useful to me while I was working on the new optimizer, so I wrote a whole new set of tests which instead run individual compiler passes then perform various checks on the IR. I also developed a general IR checker which checked invariants; I caught many bugs this way before I even compiled a single line of code.

The future


For now, I will put the compiler aside to work on other things. I plan on revisiting the web framework, polishing the documentation, and fixing bugs for a little while. When I get bored of that, I will return to compiler hacking. The optimizer needs a little bit of extra polish; once this is done, I plan on doing some benchmarking and some tuning, and I expect some interesting results. The code generator rewrite will improve performance further. So far, I'm happy with the improvement in compile time alone; bootstrap went from 8 minutes to 6 minutes on my MacBook Pro.

Three language changes

There are three language changes coming up as a result of the new high-level optimizer being merged in. All are minor, and I have already updated most code in the repository, however they're worth pointing out.

Recursive declarations


Words (combinators) which are declared inline and recursive must also be declared recursive. Most combinators are not recursive, but instead are built in terms of other combinators, such as curry, map, and so on. However the fundamental ones are recursive, and they need to be declared as such. Furthermore, input parameters which are quotations must have their stack effect annotated with a nested stack effect declaration. For now, this is only a requirement for inline recursive words; in the future, Factor might require such annotations on all quotation parameters. Here is an example; the loop combinator calls a quotation until it returns false:
: loop ( pred: ( -- ? ) -- )
dup slip swap [ loop ] [ drop ] if ; inline recursive

Note the nested stack effect, and the recursive declaration. If you find this too verbose, then consider building your combinator from other combinators, such as loop and while, instead of using explicit recursion.

The rationale for this change is that it simplifies the compiler front-end. Inline words which are recursive and those which are not are constructed differently in the compiler's IR, but the decision must be made before the word definition is inspected for recursive calls. The old compiler front-end would back-track and throw away the partially constructed IR upon encountering a recursive call, but this lead to poor performance for code with deeply nested calls to recursive combinators. The new approach simplifies the front-end and only requires a little bit of effort from the programmer.

Stricter retain stack operation checking


Before, the compiler would tolerate code such as the following:
: foo ( a b -- c ) >r [ r> 1 + ] [ r> drop 0 ] if ;

Here the retain stack usage was balanced within the entire word, but not in individual quotations. Again, this lead to complications; the compiler IR had to express not only data stack but retain stack values at control flow merge points, and with my move toward an SSA IR, the bookkeeping became a pain. Since this type of code is harder to understand and doesn't come up very often anyway, I decided to simplify the compiler and prohibit it. Now, usages of >r and r> must be balanced within every quotation, not just the entire word.

Old-style delegation has been removed


I implemented inheritance in April this year; so delegation has been deprecated for four months now, and its finally out. Most libraries have been updated to either use inheritance, or the more powerful delegation library by Daniel Ehrenberg.

Sunday, August 10, 2008

An algorithm for escape analysis

In compiler implementation, escape analysis is the process of discovering which values defined within a procedure are returned or passed down to another subroutine -- these are escaping values. Compound objects which are allocated inside the procedure, and which do not escape, are candidates for stack allocation. If applied in isolation, the main benefit of stack allocation is reduced GC overhead; however with generational garbage collection, this is not so important. The real payoff comes from combining stack allocation with unboxing -- if all usages of a compound object are replaced by its constituent values, further optimizations can be applied; the constituents might get stored in registers, or become candidates for value numbering, etc. Sometimes this optimization is referred to as scalar replacement.

I implemented escape analysis in the new compiler I'm working on for Factor. Since I couldn't find a description of the algorithm with the properties I wanted in the literature, I came up with my own and documented it here.

To start with, in Factor, almost everything is done with subroutine calls -- words are very short, and object slot access involves calling another word. And most objects pass through shuffle words on their way to their destination. So escape analysis only makes sense after inlining and type inference, and we also need to treat shuffle words specially, not as other subroutine calls. The latter is already taken care of, however; the compiler has special knowledge of shuffle words so that it can perform non-local optimizations.

A more subtle issue is that of writable slots. At present, my escape analysis only considers tuples where all slots are declared read-only as candidates for unboxing. Writing into slots creates difficulties; aliasing must be considered, and mixing the unboxing of mutable objects together with multi-shot continuations doesn't seem to make sense either (capturing a continuation copies the stack but not the heap; so if we restore a multishot continuation several times and read a mutable slot, we can observe whether unboxing took place or not by seeing if restoring the continuation restored its old value). Technically, I could extend it to unbox tuples whose slots are never written into, not just those whose slots are forced to be read-only. I will probably make this change if it avoids complicating the code too much.

Escape analysis is very beneficial in Factor; for example, iteration over a virtual sequence can be rewritten by the compiler to avoid allocating the virtual sequence altogether. It also enables programming idioms such as "cursors": you can represent the iteration state over a collection as an object, and have the object be unboxed at compile time, turning high-level functional code into tight loops in registers. In general, by reducing the cost of allocating temporary objects, better factoring is encouraged and cleaner code can be written.

The idea is basically to calculate which values may be unboxed, and for each such value, we replace its allocation with a no-op, replace each slot selection with a stack shuffle, and "widen" each shuffle that it passes through to permute its slots.

For example, suppose we define a new data type, the academic functional programmer's favorite,
TUPLE: cons { car read-only } { cdr read-only } ;
Now if we have this code,
: foo-0 ( -- a b ) 1 2 cons boa dup car>> swap cdr>> ;

We would expect the compiler to convert it to something like the following:
: foo-0' ( -- a b ) 1 2 2dup drop rot nip ;

The transformation proceeds as follows:
  • the constructor was removed altogether
  • the dup became 2dup because now the cons is two items on the stack
  • the first slot selection (car>>) became drop since this drops the second slot value while retaining the first one
  • the swap became rot since one of the inputs was the tuple, which is now two values on the stack
  • the second slot selection (cdr>>) became a nip since this drops the first slot value while retaining the second one

The codegen would then simplify the shuffling to a no-op, and the word would become nothing more than a push of two literal values on the stack.

To formalize the above, we need to come up with a good definition for a value which "escapes". We formulate this in terms of the inputs and outputs of the various nodes in the high-level SSA IR of the Factor compiler. Since this is SSA, each value is defined only once, so we may identity a value with its definition; we can say that a value is the result of a specific operation, such as a memory allocation.

First, every value has an entry in the allocation map. The entry is either f, meaning the value's definition has not been seen yet, t, meaning the value is allocated outside of this word, or a sequence of values, meaning that this value is a tuple allocation whose slots are from this sequence. When the analysis is done, no value has an entry of f, however while the analysis is in progress, we may encounter such cases, and they have to be handled -- basically, an entry of f means "this entry may change to anything else in the future".

Clearly, a value which is returned from the word in question escapes -- these values are stored as the input to a #return node in the compiler IR. Also, values which are inputs to subroutine calls escape -- these are the #call nodes. One exception is that the slot selection primitive slot, when called with a constant integer argument, does not force its input value to escape; after all, we need to have some way of accessing slots of stack-allocated objects. Calls to this primitive appear after type inference allows slot accessors to be inlined. Finally, values which are inputs to a call of the tuple construction primitive <tuple-boa> are not considered to escape either; we would like to unbox nested (but not circular) structure.

In the following two, the allocation escapes:
: foo-1 ( -- cons ) 1 2 cons boa ;

: foo-2 ( -- ) 1 2 cons boa . ;

Here it doesn't -- after inlining, car>> becomes 3 slot:
: foo-3 ( -- obj ) 1 2 cons boa car>> ;

In the following, neither of the two allocations escape:
: foo-4 ( -- obj ) 1 2 cons boa 3 cons boa car>> car>> ;

Now we run into our first complication. Consider the following code:
: foo-5 ( -- cons ) 1 2 cons boa 3 cons boa car>> ;

The first allocation escapes, the second one does not. Now clearly the result of car>> (3 slot after inlining) escapes by our above criteria; how do we link it with the output of 1 2 cons boa? Well, we can construct an undirected graph where the vertices are values, and two vertices are linked by an edge if a sufficient and necessary condition for one of them to escape is for the other to escape. We can thus look at connected components of this graph instead of individual values -- if any member of the connected component escapes, they all do. I call this graph the "escaping values graph".

So in the above example foo-5, we add an edge between the first input to the second cons boa call, and the output of car>>, and when we encounter the #return node and see that the output of car>> escapes, we will also know that its input escapes; hence the first allocation escapes, but the second one does not.

A number of nodes in the IR are created when shuffle words are converted -- these are #>r, #r>, and #shuffle. They do not make values escape, but they do define new values (this is SSA), and whether those values escape or not depends on whether their inputs escape, and vice versa.

This allows us to catch that the following allocation escapes:
: foo-6 ( -- cons ) 1 2 cons boa >r "Hello world" print r> ;

but the following doesn't:
: foo-7 ( -- obj ) 1 2 cons boa >r "Hello world" print r> cdr>> ;

A related type of node is the #copy node, which arises as a result of other optimizations. We also add edges to the escaping values graph mapping its inputs to its outputs.

This pretty much sums it up for straight-line code without branches or recursion. You build a value graph to track value equivalence, mark certain values as escaping when you encounter call and return nodes, then when you're done, you know that the allocations whose connected components in the graph do not contain escaping values can be safely unboxed.

The situation with conditionals is trickier. Suppose you have a word bar which outputs a cons, but cannot be inlined for whatever reason, and you use the word as follows:
: foo-8 ( ? -- obj ) [ 1 2 cons boa ] [ bar ] if car>> ;

Here, the first allocation does not escape -- the only place it can ever be used is the car>>; but if we replace it with two items on the stack and change the call to car>> into a stack shuffle, then in the event of the second branch being taken, we would get incorrect behavior; here we certainly do want the accessor to be called, and we don't expect bar to return two values on the stack either.

So clearly, in the above, we have to somehow infer that the result of the allocation escapes, even though it is never an input to a call or a return. How do we do this? Well, here is the SSA form for the above, in somewhat stylized, C-like form, with x the top of the stack:
if(x)
y1 = cons(1,2);
else
y2 = bar();
y = phi(y1,y2);
z = y.car;
return (z);

The problem here is the #phi node has two inputs, one of which has an entry in the allocation map {y1,y2}, and the other one has a value of t, since it is allocated outside of this word. Therefore, we cannot unbox y1, since that would mess up the #phi node. So we add a check for this; if the inputs to a #phi nodes have incompatible allocation map entries, either because of them is a sequence of values and the other is t, or because they are both sequences of values of different length, we mark all values as escaping, ruling out any unboxing. On the other hand, if all entries are compatible, we have to add new #phi nodes to unify the slots. Even if unboxing does not take place for other reasons, these #phi nodes are important to track relations between tuple slots.

For example, consider this program:
: foo-9 ( ? -- obj ) [ 1 2 cons boa ] [ 3 4 cons boa ] if car>> ;

The two allocations are compatible and the compiler should rewrite the above into roughly the following:
: foo-9' ( ? -- obj ) [ 1 ] [ 3 ] if ;

Here is the SSA form of foo-9:
if(x)
{
a1 = 1;
b1 = 2;
y1 = cons(a1,b1);
}
else
{
a2 = 3;
b2 = 4;
y2 = cons(a2,b2);
}
y = phi(y1,y2);
z = y.car;
return (z);

Escape analysis would add the following entries to the allocation map:
a1 => t
a1 => t
y1 => (a1,b1)
a2 => t
a3 => t
y2 => (a2,b2)

a3 => t
b3 => t
y => (a3,b3)

The last three entries are added as a result of the #phi node being traversed.

As for the value graph, the slot access would link z with a3 as soon as we encounter the statement z = y.car;.

However, we actually need to add more edges to the value graph than just this one. Consider the following program, almost the same as foo-9 except we return the cons cells on the stack:
: foo-10 ( ? -- cons ) [ 1 2 cons boa ] [ 3 4 cons boa ] if ;

Here, we cannot unbox either value and both allocations escape. However, while we would end up marking the output of the #phi as an escaping value, nothing would be done to the outputs of cons boa, resulting in incorrect code. So what we need to do is link the inputs and outputs of a phi in the escaping values graph, thus if any one of them escapes, they all escape. Here is the SSA form for foo-10:
if(x)
{
a1 = 1;
b1 = 2;
y1 = cons(a1,b1);
}
else
{
a2 = 3;
b2 = 4;
y2 = cons(a2,b2);
}
y = phi(y1,y2);

/* Inserted by escape analysis */
a3 = phi(a1,a2);
b3 = phi(b1,b2);

return (y);

Here is the allocation map:
a1 => t
a1 => t
y1 => (a1,b1)
a2 => t
a3 => t
y2 => (a2,b2)
a3 => t
b3 => t
y => (a3,b3)

And here are the edges of the escaping values graph for foo-10:
a3 -- a1
a3 -- a2
b3 -- b1
b3 -- b2
y -- y1
y -- y2

Because y escapes, none of the values in its connected component -- namely, y, y1, and y2 -- are candidates for unboxing.

This approach handles all other corner cases involving branches also:
: foo-11 ( ? -- cons ) [ 1 2 cons boa dup . ] [ 3 4 cons boa ] if car>> ;

Here, the first allocation cannot be unboxed, and this prevents the second allocation from being unboxed since they meet at a #phi node.

Here is another example:
: foo-12 ( ? -- cons ) [ 1 2 cons 3 cons boa ] [ bar 4 cons boa ] if car>> car>> ;

What happens here? Well, let's look at the SSA form:
if(x)
{
a1 = 1;
b1 = 2;
y1 = cons(a1,b1);
c1 = 3;
z1 = cons(y1,c1);
}
else
{
y2 = bar();
c2 = 4;
z2 = cons(y2,c2);
}

z = phi(z1,z2);

/* Inserted by escape analysis */
y3 = phi(y1,y2);
c3 = phi(c1,c3);

u = z.car;
v = u.car;

return(u);

Here is the allocation map:
a1 => t
b1 => t
y1 => {a1,b1}
c1 => t
z1 => {y1,c1}
y2 => t
c2 => t
z2 => {y2,c2}
y3 => t
c3 => t
z => {y3,c3}
u => t
v => t

Here are the edges of the value graph:
y3 -- y1
y3 -- y2
z -- z1
z -- z2
v -- y3

Note that while y1 does not escape, it is in the same connected component as y2, which does escape, therefore y1 cannot be unboxed. However, z1 and z2 can both be unboxed.

Recursion and loops arising from usage of combinators such as each create #phi nodes where some of the inputs are defined after the #phi node in a post-order traversal of the dominator graph.

For example,
10 [ . ] each

yields the following SSA form:
int x0 = 0;
int x1 = 10;

L: x2 = phi(x0,x3);
if(x2 < x1)
{
.(x2);
x3 = x2 + 1;
goto L;
}

Note that x3 is defined after the #phi node referencing it.

When the escape allocation encounters, some of the inputs have entries of f in the allocation map. These values are ignored when checking for compatibility; we optimistically assume that they will later be defined in a way which does not force the value to be escaping. After the first iteration of the analysis, all values should now have non-f entries in the allocation map. We repeat the analysis until the allocation map entries of the outputs of #phi nodes stop changing; while it is possible to construct code where this can take any number of iterations, in real examples it should terminate after one or two passes.

For example, in the following, the allocations can be unboxed:
: foo-13 ( n -- obj ) 1 2 cons boa swap [ drop 3 4 cons boa ] times car>> ;

We'd expect the compiler to convert the code into the following:
: foo-13' ( n -- obj ) 1 2 rot [ 2drop 3 4 ] times drop ;

Here is the SSA form:
int a1 = 1;
int b1 = 2;
int y1 = cons(a1,b1);

int x0 = 0;

L: x2 = phi(x1,x3);

/* Inserted by escape analysis */
a3 = phi(a1,a2);
b3 = phi(b1,b2);

y2 = phi(y1,y3);
if(x2 < n)
{
x3 = x2 + 1;
a2 = 3;
b2 = 4;
y3 = cons(a2,b2);
goto L;
}

z = y2.car;
return(z);

Here is the allocation map:
a1 => t
b1 => t
y1 => {a1,b1}
x0 => t
x2 => t
a3 => t
b3 => t
y2 => {a3,b3}
x3 => t
a2 => t
b2 => t
y3 => {a2,b2}
z => t

And the edges of the escaping values graph:
a3 -- a1
a3 -- a2
x2 -- x1
x2 -- x3
b3 -- b1
b3 -- b2
y2 -- y1
y2 -- y3

The only value marked as escaping is z, and its connected component only contains itself, so the two allocations may be unboxed.

On the other hand, if we have
: foo-14 ( n -- obj ) 1 2 cons boa [ 3 cons boa ] times car>> ;

Unboxing cannot take place. We cannot unbox the second allocation, because it builds recursive structure, and we cannot unbox the first because its result is stored inside the second allocation, which is not unboxed. Here is the SSA form:
int a1 = 1;
int b1 = 2;
int y1 = cons(a1,b1);

int x0 = 0;

L: x2 = phi(x1,x3);
y2 = phi(y1,y3);

/* Inserted by escape analysis */
a3 = phi(y2,a1);
b3 = phi(b1,a2);

if(x2 < n)
{
x3 = x2 + 1;
a2 = 3;
y3 = cons(y2,a2);
goto L;
}

And the allocation map:
a1 => t
b1 => t
y1 => {a1,b1}
x0 => t
x2 => t
a3 => t
b3 => t
y2 => {a3,b3}
x3 => t
a2 => t
y3 => {y2,a2}
z => t

And the edges of the escaping values graph:
a3 -- a1
a3 -- y2
x2 -- x1
x2 -- x3
b3 -- b1
b3 -- a2
y2 -- y1
y2 -- y3

We see that y2 and y3 are in the same connected component as a3, which has an allocation map entry of t, therefore neither one may be unboxed.

As far as the implementation goes, the above gives a high-level overview. The only tricky part is the implementation of the escaping values graph. I used a disjoint set (union find); adding an edge adds equivalences, and marking a value as an escaping value makes it equivalent to a special "escaping" object. Checking if a value is in the same connected component as an escaping value then is just a check to see if the value is equivalent to "escaping". A disjoint set is a slightly exotic data structure which perform this check in what is essentially constant time.

I implemented this algorithm; for now, you can find it in the unfinished/compiler/tree/escape-analysis directory. The pass which actually performs the unboxing using the information collected during escape analysis is in unfinished/compiler/tree/tuple-unboxing. In comparison to the analysis, the rewrite stage is pretty simple, so I won't describe the details here.

Wednesday, August 06, 2008

Persistent hashtables

Long time no blog! I've been very busy (with Factor), and there has been a lot of activity lately. I've been working flat-out on the new compiler, which I will blog about later.

Yesterday and today though, I worked on porting Clojure's PersistentHashMap to Factor. We already had persistent vectors; the hashtables are more complex. My implementation is pretty much a direct port of the Java code, and in some parts it uses locals heavily. I plan on revisiting it at some point and using more Factor idioms. Even though the code is not idiomatic, it is still pretty easy to follow, and Factor's direct support from having multiple return values eliminated the need for a temporary 'box' object.

My initial tests seem to indicate persistent hashtables are 2-4x slower than normal hashtables for common operations. This isn't so bad, since its only a constant time slowdown, and you gain immutability, but it is still a bit disappointing. Perhaps they can be optimized further.

The reason I wanted to implement persistent hashtables in the first place is to use them in the new Factor compiler. Instead of cloning a large hashtable repeatedly in one place, which I suspected was leading to quadratic algorithmic complexity, I wanted to use persistent hashtables so that I could add entries in constant time while still retaining older versions for backtracking.

It turns out I found another way to get the performance gain I wanted, and so persistent hashtables are unused in the new compiler for now.

However they were fun to implement, and they're now part of the basis library; an intermediate layer between core and extra, in the persistent.hashtables vocabulary. They sit alongside persistent.vectors and persistent.heaps, the latter of which was implemented by Daniel Ehrenberg.

The new basis vocabulary root will contain libraries which are not fundamentally part of the language core, and are not required for bootstrap, but those that anyone would expect a language implementation to provide. Examples include various data structures, I/O, XML parsing, database access, the GUI toolkit, the web framework, and so on. I also moved the optimizing compiler to basis, since its not required for bootstrap either.

In order for a vocabulary to be moved to basis, it must implement some functionality that's frequently used required, and have high quality documentation, robust and optimized implementation, and as many unit tests as possible. Over the next few months, I hope that more vocabularies are moved to basis and that the code quality there only goes up.