Monday, November 03, 2008

New low-level optimizer and code generator

I have pushed the new low-level optimizer and code generator. This is the second stage of a big compiler overhaul; the first stage was the new front-end and high-level optimizer. I first started working on the new code generator in July, but after spending a couple of weeks on it, put it aside because I first wanted to replace the front-end and high-level optimizer. I continued working on the code generator in early October.

The new low-level optimizer and code generator replaces the old one-pass code generator. While the old design was relatively simple and generated decent code, it was hard to extend and its limitations were really starting to show, especially on floating point code.

The new design is a complete departure from the old way of doing things. First of all, there is a low-level intermediate representation, consisting of a control flow graph of basic blocks. Each basic block is a list of instructions. Instructions operate on virtual registers, of which there is an unlimited supply, and each virtual register is only assigned once. This invariant, known as SSA, simplifies analysis and code generation.

Building low-level IR


The low-level optimizer receives high-level compiler IR as input. The first step is to convert it to low-level IR. This is done by the compiler.cfg.builder vocabulary. There are two primary differences between high-level IR and low-level IR:
  • High-level IR represents control structures as nested trees. This explains the compiler.tree vocabulary name. So a piece of code like [ [ 1 + ] [ 1 - ] if 2 * ] becomes a list of three nodes, an #if, a #push, and a #call, where the #if has two more nodes. Low-level IR models the control flow graph directly, so a conditional becomes a node with two successors, where both successors have a common successor. Loops create actual loops in the graph.
  • High-level IR is stack-based; stripping out value names gives a valid stack program. Low-level IR makes all stack operations explicit; for example, a call to a primitive such as eq? is a single node in high-level IR, whereas in low-level IR it expands into instructions to read the top two stack items, an instruction to compare the two values in registers, and finally, instructions to store the boolean back on the stack:
    ##peek V int-regs 7 D 1 
    ##peek V int-regs 8 D 0
    ##inc-d -2
    ##compare V int-regs 9 V int-regs 7 V int-regs 8 cc=
    ##inc-d 1
    ##replace V int-regs 9 D 0

  • By breaking down higher-level stack operations into lower-level register operations, redundancy is revealed, and subsequent optimization passes will eliminate it.

The high-level IR to low-level IR conversion is very naive. Other than tail-call optimization, it makes no attempt to eliminate any redundancy in the resulting IR. For example, floating point primitives expand into stack loads, followed by unboxing, followed by the operation itself, and finally a boxing operation. Redundant boxing and unboxing of intermediate values is hidden in high-level IR but becomes explicit in low-level IR.

Useless block elimination


Once the CFG has been built, optimization begins. There are a total of 8 optimization passes; I won't talk about all of them, only the interesting ones. The first optimization pass, compiler.cfg.useless-blocks, deletes basic blocks which consist solely of a jump to their successor, as well as conditional blocks where both branches jump to the same basic block. The former case occurs because the low level IR builder does not perform any optimization. The latter case occurs after some code is optimized with the high-level optimizer. For example, suppose you're using contains? to stop iterating early, and don't actually care for the result.
( scratchpad ) [ [ ] contains? drop ] optimized.
[
dup length swap 0 -rot \ ( gensym ) [
pick pick < [
swap
>r 2dup
>r
>r nth-unsafe r>
r>
r>
swap
>r rot r>
swap
[ 2drop ]
[ >r >r 1 +-integer-fixnum r> r> ( gensym ) ] if
] [ 3drop f ] if
] label dup [ ] [ ] if [ ] [ ] if
]

Note that the two empty conditionals at the end came up because of inlining and dead code elimination. The high level optimizer's dead code elimination doesn't get rid of the empty conditionals, but the low level optimizer does.

The old code generator did not delete empty conditionals.

Height normalization


After all stack operations in a basic block are expanded into low level IR, there might be a lot of redundant changes to the stack height. The compiler.cfg.height pass takes care of these. For example, >r rot r> translates as follows,
##peek V int-regs 63 D 0 
##inc-d -1
##inc-r 1
##replace V int-regs 63 R 0
##peek V int-regs 64 D 2
##peek V int-regs 65 D 1
##peek V int-regs 66 D 0
##inc-d -3
##inc-d 3
##replace V int-regs 64 D 0
##replace V int-regs 66 D 1
##replace V int-regs 65 D 2
##peek V int-regs 67 R 0
##inc-r -1
##inc-d 1
##replace V int-regs 67 D 0

After height normalization,
##peek V int-regs 68 D 0 
##replace V int-regs 68 R -1
##peek V int-regs 69 D 3
##peek V int-regs 70 D 2
##peek V int-regs 71 D 1
##replace V int-regs 69 D 1
##replace V int-regs 71 D 2
##replace V int-regs 70 D 3
##peek V int-regs 72 R -1
##replace V int-regs 72 D 0

After the height normalization pass runs, the only height changes that remain are at the start of basic blocks, and all ##peek and ##replace instructions can be viewed as reads and writes of distinguished arrays.

The old code generator performed the same optimization, except stack height changes were finalized at the end of a basic block, not at the start. This makes no difference in practice.

Alias analysis


I won't attempt to give a description of alias analysis here, I'll only explain how it applies to Factor. Alias analysis is implemented in compiler.cfg.alias-analysis. The low level IR builder expands stack operations naively, into memory loads/stores. Once stack height changes are coalesced at the start of a basic block, all stack reads and writes look just like array accesses with constant indexes. In addition to this, we of course also have reads and writes of real heap-allocated arrays and tuples. Alias analysis eliminates redundant memory operations by identifying the following patterns:
  • If the same location is read twice with no writes in between, the second read is replaced with a copy.
  • If a location is read after being written, the read is replaced with a copy of the value that was written there.
  • If a location is written to twice with no read in between, the first write is eliminated.

After alias analysis runs, each stack location is only read or written at most once, and writes follow reads, per basic block. For array slots this is more complicated. Because two references may in general point to the same array, some optimizations are invalid (for example, if we read from one array, write to another, then read from the first array, the latter read cannot be eliminated because the result depends on whether the two arrays are equal or not). However, a freshly allocated array or tuple is known to be different from any other object, so more aggressive optimizations can be performed in this case. One situation where alias analysis really helps is with inline allocation of arrays. Consider the 2array word. After type inference and inlining, the high-level IR looks like so:
( scratchpad ) \ 2array optimized.
[ 2 f <array> tuck >r r> 3 set-slot tuck >r r> 2 set-slot ]

The semantics of the <array> word stipulate that the new array be filled with the initial element, in this case f. However, the array elements are immediately overwritten with stack items. Instead of making 2array and similar words into primitives, or adding a new unsafe constructor for an array filled with garbage, the low level IR builder expands <array> into several instructions: first a memory allocation, then an initial value fill. This makes the redundancy explicit. We get the following before alias analysis:
_label 0 
##prologue
_label 1
_gc
##inc-d -1
##load-immediate V int-regs 97 16
##replace V int-regs 97 D -2
##load-immediate V int-regs 98 7
##replace V int-regs 98 D -3
##peek V int-regs 99 D -3
##allot V int-regs 100 16 array V int-regs 101
##load-immediate V int-regs 102 16
##set-slot-imm V int-regs 102 V int-regs 100 1 3
##set-slot-imm V int-regs 99 V int-regs 100 2 3
##set-slot-imm V int-regs 99 V int-regs 100 3 3
##replace V int-regs 100 D -2
##peek V int-regs 103 D -1
##peek V int-regs 104 D -2
##replace V int-regs 104 D -3
##replace V int-regs 103 D -2
##replace V int-regs 104 D -1
##peek V int-regs 105 D -3
##replace V int-regs 105 R -1
##peek V int-regs 106 R -1
##replace V int-regs 106 D -3
##load-immediate V int-regs 107 24
##replace V int-regs 107 D -4
##peek V int-regs 108 D -2
##peek V int-regs 109 D -3
##set-slot-imm V int-regs 108 V int-regs 109 3 3
##write-barrier V int-regs 109 V int-regs 110 V int-regs 111
##peek V int-regs 112 D 0
##peek V int-regs 113 D -1
##replace V int-regs 113 D -2
##replace V int-regs 112 D -1
##replace V int-regs 113 D 0
##peek V int-regs 114 D -2
##replace V int-regs 114 R -1
##peek V int-regs 115 R -1
##replace V int-regs 115 D -2
##load-immediate V int-regs 116 16
##replace V int-regs 116 D -3
##peek V int-regs 117 D -1
##peek V int-regs 118 D -2
##set-slot-imm V int-regs 117 V int-regs 118 2 3
##write-barrier V int-regs 118 V int-regs 119 V int-regs 120
##epilogue
##return

Note the large number of ##peek and ##replace instructions, most of which is redundant stack traffic, as well as the multiple calls to ##set-slot-imm, some of which are redundant. After alias analysis eliminates the unnecessary stack and array operations, we end up with:
_label 0 
##prologue
_label 1
_gc
##inc-d -1
##load-immediate V int-regs 121 16
##load-immediate V int-regs 122 7
##copy V int-regs 123 V int-regs 122
##allot V int-regs 124 16 array V int-regs 125
##load-immediate V int-regs 126 16
##set-slot-imm V int-regs 126 V int-regs 124 1 3
##peek V int-regs 127 D -1
##copy V int-regs 128 V int-regs 124
##copy V int-regs 129 V int-regs 124
##copy V int-regs 130 V int-regs 124
##load-immediate V int-regs 131 24
##copy V int-regs 132 V int-regs 127
##copy V int-regs 133 V int-regs 124
##set-slot-imm V int-regs 132 V int-regs 133 3 3
##write-barrier V int-regs 133 V int-regs 134 V int-regs 135
##peek V int-regs 136 D 0
##copy V int-regs 137 V int-regs 124
##replace V int-regs 137 D 0
##copy V int-regs 138 V int-regs 124
##copy V int-regs 139 V int-regs 124
##load-immediate V int-regs 140 16
##copy V int-regs 141 V int-regs 136
##copy V int-regs 142 V int-regs 124
##set-slot-imm V int-regs 141 V int-regs 142 2 3
##write-barrier V int-regs 142 V int-regs 143 V int-regs 144
##epilogue
##return

Note that the redundant memory operations were eliminated, but new redundancies have appeared in the form of ##copy instructions. The next pass eliminates these.

Value numbering


Value numbering is implemented in compiler.cfg.value-numbering. The value numbering pass identifies sets of virtual registers which always contain the same value. In each congruence class, the first virtual register computing the value is known as the leader. References to virtual registers other than the leader are replaced with the leader in subsequent instructions.

To figure out which virtual registers always contain the same value, value numbering makes a pass over instructions and builds a value graph. This is a directed acyclic graph where the vertices are called expressions. Every instruction is considered in turn. The instruction's input virtual registers are replaced with the leaders in their value class. Next, an expression is built from the instruction. The expression is simplified using algebraic identities. If the expression does not have an associated value number, the instruction's output register becomes the leader of a new value class.

Copy propagation falls out as a special case of value numbering. Because the destination of a copy always has the same value as the source, subsequent references to the destination are replaced with the source register.

Floating point unboxing is also a special case. Consider the following low level IR, generated by the Factor code [ float+ 3.0 float* ]:
_label 0 
##prologue
_label 1
_gc
##inc-d -1
##peek V int-regs 160 D 0
##peek V int-regs 161 D -1
##unbox-float V double-float-regs 162 V int-regs 160
##unbox-float V double-float-regs 163 V int-regs 161
##add-float V double-float-regs 164 V double-float-regs 162 V double-float-regs 163
##box-float V int-regs 165 V double-float-regs 164 V int-regs 166
##replace V int-regs 165 D 0
##load-indirect V int-regs 167 3.0
##replace V int-regs 167 D -1
##peek V int-regs 168 D 0
##peek V int-regs 169 D -1
##unbox-float V double-float-regs 170 V int-regs 168
##unbox-float V double-float-regs 171 V int-regs 169
##mul-float V double-float-regs 172 V double-float-regs 170 V double-float-regs 171
##box-float V int-regs 173 V double-float-regs 172 V int-regs 174
##replace V int-regs 173 D 0
##epilogue
##return

The result of float+ is boxed, written to the stack, then loaded from the stack and unboxed. Clearly, this is inefficient.

After alias analysis, the redundant stack store/load is eliminated, but there is still a useless boxing and unboxing, and some useless copies:
_label 0 
##prologue
_label 1
_gc
##inc-d -1
##peek V int-regs 190 D 0
##peek V int-regs 191 D -1
##unbox-float V double-float-regs 192 V int-regs 190
##unbox-float V double-float-regs 193 V int-regs 191
##add-float V double-float-regs 194 V double-float-regs 192 V double-float-regs 193
##box-float V int-regs 195 V double-float-regs 194 V int-regs 196
##load-indirect V int-regs 197 3.0
##copy V int-regs 198 V int-regs 195
##copy V int-regs 199 V int-regs 197
##unbox-float V double-float-regs 200 V int-regs 198
##unbox-float V double-float-regs 201 V int-regs 199
##mul-float V double-float-regs 202 V double-float-regs 200 V double-float-regs 201
##box-float V int-regs 203 V double-float-regs 202 V int-regs 204
##replace V int-regs 203 D 0
##epilogue
##return

Note that register 194 is boxed into register 195, which is then copied into register 198; finally, register 198 is unboxed into register 200. The value graph contains this path. The expression simplifier then realizes that register 200 always contains the same value as register 194, and thus subsequent uses of register 200 are replaced with register 194.

In addition to eliminating unnecessary boxing and unboxing of floating point values, value numbering also optimizes conditionals. For example, [ 100 [ ] times ] generates the following high-level IR:
( scratchpad ) [ 100 [ ] times ] optimized.
[
100 0 swap \ ( gensym ) [
over over fixnum<
[ >r >r r> r> >r 1 fixnum+fast r> ( gensym ) ]
[ 2drop ] if
] label
]

Implemented naively, fixnum< would compare two values and push a boolean, then if would check if the boolean is true or false. The redundancy here stems from the fact that on a real CPU, you can compare two values and then branch if the result of the comparison is a less-than; there is no need to make a boolean (which itself involves a conditional). The expression simplification machinery used by value numbering converts calls of if where the boolean was produced by a known primitive into more efficient code.

Another neat simplification concerns type checks. The dispatch code generated for generic words often contains code like the following,
[ dup tag 2 eq? [ ... ] [ ... ] if ]

The tag primitive outputs the low 3 bits of a pointer, as a fixnum. Tag #2 is used for tuple objects. Here is the low level IR generated for a call to tag:
##peek V int-regs 208 D 0 
##and-imm V int-regs 209 V int-regs 208 7
##shl-imm V int-regs 210 V int-regs 209 3
##replace V int-regs 210 D 0

Note that the low 3 bits of the pointer are masked, then the result is shifted to the left to make it into a fixnum. Of course, if the result of the shift is being compared against a literal fixnum, the shift is redundant; instead, we can shift the literal to the right by 3. Indeed, after value numbering, tag 2 eq? generates this low-level IR:
##peek V int-regs 211 D 0 
##and-imm V int-regs 212 V int-regs 211 7
##shl-imm V int-regs 213 V int-regs 212 3
##load-immediate V int-regs 214 16
##copy V int-regs 215 V int-regs 213
##compare-imm V int-regs 216 V int-regs 213 16 cc=
##replace V int-regs 216 D 0

Note that in the above, the ##shl-imm instruction still remains even though its result is not used, and there is a ##load-immediate whose result is not used either. Just like alias analysis inserts redundant copy instructions, value numbering leaves behind instructions which compute values that are never used; if an instruction computes a value that is already available in another register, subsequent uses of this instruction's output register are replaced, but the instruction still remains. The next pass eliminates such useless instructions.

The old code generator performed some optimizations that value numbering can do, such as float unboxing and efficient generation of integer and float conditionals, but it was not as general and there were many redundancies it did not catch.

Dead code elimination


Dead code elimination is an essential part of any optimizer. It is implemented in compiler.cfg.dead-code. Note that the high-level optimizer performs dead code elimination as well, in compiler.tree.dead-code. The rationale is that many optimizations are easier to implement if they can leave behind code which computes values that are never used, instead of trying to eliminate all redundancy in one pass. In the low-level optimizer, both alias analysis and value numbering can leave behind many instructions whose results are no longer used. Dead code elimination builds a graph of def-use chains, then takes the transitive closure starting from the registers used by instructions having side effects observable outside of the basic block, such as ##set-slot and ##replace; any instructions whose destination registers are not part of this closure can be eliminated, since their values are never used. This pass also eliminates ##replace instructions which store a value to the same stack location from which it was read. This can arise if inlining produces an idempotent stack shuffle such as swap swap, or if value numbering reduces an operation to a no-op.

The old code generation took a different approach: since it generated code in one pass, it had to take care not to emit dead code in the first place. This complicated things; for example, constants were not loaded into registers until absolutely necessary, and complicated state had to be maintained to ensure that no stack location was read or written redundantly.

Write barrier elimination


The compiler.cfg.write-barrier pass attempts to speed up slot assignment. Because Factor uses a generational garbage collector, every store into an object slot must potentially mark the card holding the object. However, there are three cases when the write barrier can be omitted:
  • If the slot value is immediate -- that is, a fixnum or f
  • If the object was allocated in the same basic block; new objects go in the nursery, so the object cannot be older than the slot value
  • If the object has already been stored to in this basic block; there is no need to generate more than one write barrier since the GC can only run at the beginning of a basic block

The first case is handled by the low-level IR builder; it uses type information from the high-level IR to omit the write barrier in this case. The latter two cases are handled by a simple analysis that identifies fresh allocations and objects which are written to more than once, deleting ##write-barrier instructions which are seen to be redundant.

Here is the result of applying all the optimizations described so far to the 2array word:
_label 0 
##prologue
_label 1
_gc
##inc-d -1
##load-immediate V int-regs 566340 16
##allot V int-regs 566343 32 array V int-regs 566344
##set-slot-imm V int-regs 566340 V int-regs 566343 1 3
##peek V int-regs 566346 D -1
##set-slot-imm V int-regs 566346 V int-regs 566343 3 3
##peek V int-regs 566355 D 0
##replace V int-regs 566343 D 0
##set-slot-imm V int-regs 566355 V int-regs 566343 2 3
##epilogue
##return

This is as tight as it gets. There are no redundant memory operations at all, and at this stage recoding 2array as a primitive in the VM would offer no benefit at all.

Write barrier elimination is the last pass of the CFG optimizer. The old code generator performed similar optimizations for storing immediate values into slots, and for storing into freshly allocated objects. It did not eliminate the write barrier in the third case, however.

After this point, the control flow graph is no longer needed, and the CFG is converted into another form called the machine representation. This form consists of a flat list of instructions, with branches and labels instead of control flow graph edges. Otherwise, the instructions are the same as in the CFG representation.

Machine representation


The machine representation builder, or linearizer, in compiler.cfg.linearization, traverses the CFG in reverse post order, generating labels and jumps as appropriate. The linearizer inserts GC checks in basic blocks which allocate memory. This is done as after value numbering, since value numbering eliminates many allocations. The linearizer also performs a simple optimization concerning how conditionals are laid out in memory. Sometimes, reversing a conditional results in one less jump. For example, consider the following code: [ [ 1 + ] unless 3 * ]. Without any type information, the old code generator would emit the following assembly:
0x0000000108e577e0: mov    $0x108e577e0,%rbx
0x0000000108e577ea: pushq $0x20
0x0000000108e577ef: push %rbx
0x0000000108e577f0: sub $0x8,%rsp
0x0000000108e577f4: mov (%r14),%rax
0x0000000108e577f7: cmp $0x7,%rax
0x0000000108e577fb: je 0x108e5780a
0x0000000108e57801: sub $0x8,%r14
0x0000000108e57805: jmpq 0x108e5781c
0x0000000108e5780a: mov $0x8,%rbx
0x0000000108e57814: mov %rbx,(%r14)
0x0000000108e57817: callq 0x10894fc20
0x0000000108e5781c: mov $0x18,%rbx
0x0000000108e57826: mov %rbx,0x8(%r14)
0x0000000108e5782a: add $0x8,%r14
0x0000000108e5782e: add $0x18,%rsp
0x0000000108e57832: jmpq 0x1089663a0

Note that if the comparison fails, the je branch falls through, and then we immediately jump again. Instead, the comparison can be reversed, eliminating the double jump. The new code generator produces the following assembly for the above snippet:
0x000000010980a0e0: mov    $0x10980a0e0,%rax
0x000000010980a0ea: pushq $0x20
0x000000010980a0ef: push %rax
0x000000010980a0f0: sub $0x8,%rsp
0x000000010980a0f4: sub $0x8,%r14
0x000000010980a0f8: mov 0x8(%r14),%rax
0x000000010980a0fc: cmp $0x7,%rax
0x000000010980a100: jne 0x10980a11c
0x000000010980a106: add $0x8,%r14
0x000000010980a10a: mov $0x8,%rax
0x000000010980a114: mov %rax,(%r14)
0x000000010980a117: callq 0x1092e9560
0x000000010980a11c: add $0x8,%r14
0x000000010980a120: mov $0x18,%rax
0x000000010980a12a: mov %rax,(%r14)
0x000000010980a12d: add $0x18,%rsp
0x000000010980a131: jmpq 0x1092ed700

Note that the conditional was reversed into a jne, and now there is no control flow redundancy.

Two-operand conversion


Up until this point, the CFG and machine representations are in SSA (single static assignment) form; each register is only defined once. Instructions are in "3-address" form; an addition is conceptually similar to this C code,
int x = y + z;

This is a problem for x86, where all arithmetic instructions store the result in the first input; x86 is a "2-address" architecture:
int y = y + z;

To be able to generate code for the x86, the compiler.cfg.two-operand pass rewrites
int x = y + z;

into
int x = y; x = x + z;

This inserts ##copy instructions into the IR, and the resulting IR is no longer in SSA form. After this pass runs, there are many new copy instructions in the IR. Register allocation eliminates redundant copies (in the above example, the assignment of y to x would be redundant if x is never used again; in this case we could just destructively modify y.) On PowerPC, this pass is not used, since it is a 3-address architecture.

The old code generator uses a complicated system of "intrinsics" and "templates" to select instructions. There was no low level IR, hence no SSA, instead various ad-hoc heuristics were used to ensure that registers did not get clobbered. This was error-prone, and in the more complex cases, the codegen would give up and spill all registers to the stack before reloading them again. The new approach is much better.

Register allocation


The compiler.cfg.linear-scan vocabulary implements a register allocator. This uses the second-chance binpacking variation of linear scan, amended with a simple form of coalescing to eliminate redundant copies inserted by the two operand conversion pass. Linear scan register allocation is very clever and I will not attempt to describe it here. Two good references are:

Note that the original formulation of linear scan register allocation required scratch registers to be reserved for reloading spilled values, and these scratch registers could not be used for anything else. This is unacceptable on x86, which has a serious shortage of registers. The binpacking variant of linear scan uses live interval splitting to ensure that there is always a register available for reloading spill slots.

Linear scan seems to work quite well in practice, with very little spilling, even on x86.

The old code generator used a very ad-hoc register assignment scheme (I don't even want to call it a "register allocator"). Basically, it would spill all values to the stack (and box them, if they were floats!) as soon as it ran out of registers. So for example, if on x86 it ran out of integer registers but not floating point registers, and you had some unboxed floats in the floating point registers, they would get boxed anyway, leading to terrible performance. To make matters worse, one scratch register had to be reserved because of the on-the-fly nature of the code generator. The new register allocator does not suffer from any of these problems. If you run out of registers, only a minimum number of values will be spilled, spilling floats never boxes them, and there is one additional register available for temporaries on x86.

It's pretty embarrassing that until now, Factor's compiler did not have a real register allocator. We got pretty far with the ad-hoc approach but it was time to do things correctly.

Code generation


After register allocation, machine code is generated using CPU-specific hooks. Since low-level IR instructions are a lot more fine-grained than Factor primitives, there is a corresponding reduction in the complexity of a CPU backend. Despite offering improved performance, the new code generator actually requires less CPU-specific code to be written. Other than that, there isn't much new here, except that jump tables used by generic dispatch, and loop entry blocks, are aligned on the correct boundaries for x86.

Some performance numbers


Soon I'm going to do a comprehensive performance review, comparing Factor's progress over the last year. In the meantime, here are some quick performance measurements of the latest code against the Mac OS X binary from the 21st of October. The computer in question is a Mac Book Pro with a 2.4 GHz Intel Core 2 Duo. For each benchmark, I performed 10 trials, with a full GC before each trial. Times are in milliseconds.
BenchmarkOld codegenNew codegen
dawes176 179 175 175 174 175 175 178 174 175145 149 147 148 147 147 148 148 150 148
mandel299 298 300 299 297 297 297 298 296 300157 156 155 156 156 155 154 155 157 154
nsieve996 1003 997 1009 1000 1007 1001 1008 1003 1001996 954 956 966 961 966 961 964 962 958
nsieve-bits2398 2403 2397 2412 2415 2410 2403 2397 2405 24241840 1838 1835 1840 1833 1833 1839 1847 1829 1833
nsieve-bytes353 355 347 345 354 356 355 354 350 355319 327 323 321 318 323 324 322 319 324
partial-sums3217 3209 3232 3209 3227 3237 3277 3231 3210 32092863 2908 2881 2880 2876 2871 2875 2861 2871 2868
raytracer6536 6559 6538 6550 6545 6554 6533 6556 6508 65814918 4918 4936 4906 4907 4941 4935 4931 4929 4898
reverse-complement9376 9229 9252 9247 9483 9224 9280 9266 9559 93479304 9024 9242 9163 9279 9182 9177 9024 9218 8990
recursive8416 8398 8532 8397 8599 8728 8644 8675 8554 88769074 8880 8898 8922 8821 8770 8824 8875 8874 8801
spectral-norm71282 71251 71224 71172 71184 71179 71181 71418 71268 7118329585 30331 29927 29873 29644 29829 29905 29864 29954 29897

Amazing speedups on mandel and spectral-norm! The mandel benchmark is now more than 10 times faster than it was before I rewrote the high-level optimizer, and raytracer is almost 3 times faster than it was a month ago.

Some observations


When I first started writing the new code generator, it was a completely distinct body of code from the old one. I got as far as implementing alias analysis and value numbering, and then I decided this was too drastic. Instead took a different approach, where I incrementally refactored the old code generator and merged in the new pieces. I got the old code generator to produce low level IR instead of machine code; it would still perform all optimization in one pass. Then I got rid of he ad-hoc register allocation that it performs, and implemented linear scan. The next step was to refactor it to produce SSA code; after that, I merged in the optimization passes I had already implemented earlier. The result was that I was able to replace the entire code generator, while testing it incrementally along the way. I managed to avoid a bootstrap time regression, despite the fact that the new code generator does a lot more work than the old one, by ensuring that the code gen "pays for itself" by generating better code, and by optimizing some other things.

What's next


As you can see from the benchmarks, performance has improved, especially on floating point code. There is one regression on the "recursive" benchmark, the PowerPC backend is not done, and there are a couple of bugs. Those should all be easy to fix. Longer-term, the next thing I will work on is global optimization. The new code generator, like the old one, only optimizes in basic blocks. This means that a floating point loop that keeps floats on the stack will box them on every iteration, for example; and in general, loops load and store on the stack. Solving this involves making alias analysis and value numbering smarter, and replacing linear scan register allocation with something more advanced, such as chordal graph coloring.

There are other things I want to do as far as performance goes, other than working on the compiler itself. I'm going to investigate inline caching to speed up generic word call sites which are monomorphic, but the compiler is unable to infer this statically. I will also look into various techniques for speeding up polymorphic dispatch. Finally, I will be optimizing the garbage collector at some point.

No comments: