Consider a word like
+. If compile-time type information is not available, this word has to look at the types of the top two stack items, and take appropriate action depending on whether they are fixnums, bignums, ratios, floats, or complex numbers. Until now, this dispatch was implemented by two indirect jumps. The type of the first object is used to key a dispatch table, where each entry then looks at the type of the second object and keys into a second-level dispatch table; the entries in the latter are the actual methods implementing the arithmetic operation.In Factor, the generic word system generates Factor code for performing method dispatch. If you look at the definition of a word such as
+ in an older build, you'll see this two-level jump table pattern, implemented with Factor's dispatch combinator:( scratchpad ) \ + def>> .
[
over tag {
[
dup tag {
[ fixnum=>+ ]
[ [ >bignum ] dip bignum=>+ ]
[ \ + no-math-method ]
[ \ + no-math-method ]
[ ratio=>+ ]
[ [ >float ] dip float=>+ ]
[ complex=>+ ]
[ \ + no-math-method ]
} dispatch
]
[
dup tag {
[ >bignum bignum=>+ ]
[ bignum=>+ ]
[ \ + no-math-method ]
[ \ + no-math-method ]
[ ratio=>+ ]
[ [ >float ] dip float=>+ ]
[ complex=>+ ]
[ \ + no-math-method ]
} dispatch
]
[ \ + no-math-method ]
[ \ + no-math-method ]
[
dup tag {
[ ratio=>+ ]
[ ratio=>+ ]
[ \ + no-math-method ]
[ \ + no-math-method ]
[ ratio=>+ ]
[ [ >float ] dip float=>+ ]
[ complex=>+ ]
[ \ + no-math-method ]
} dispatch
]
[
dup tag {
[ >float float=>+ ]
[ >float float=>+ ]
[ \ + no-math-method ]
[ \ + no-math-method ]
[ >float float=>+ ]
[ float=>+ ]
[ complex=>+ ]
[ \ + no-math-method ]
} dispatch
]
[
dup tag {
[ complex=>+ ]
[ complex=>+ ]
[ \ + no-math-method ]
[ \ + no-math-method ]
[ complex=>+ ]
[ complex=>+ ]
[ complex=>+ ]
[ \ + no-math-method ]
} dispatch
]
[ \ + no-math-method ]
} dispatch
]
The
dispatch combinator compiles down to an indirect jump, and indirect jumps are expensive on modern CPUs. Also, note that this dispatch scheme did not favor any single numeric type over another. However, in practice, operations on bignums and ratios will always be slow, and dispatch overhead is neglegible there; whereas code that works with floating point numbers and complex numbers can only be fast if the compiler can infer types and unbox values statically, in which case there is no runtime dispatch at all. So performance of runtime arithmetic dispatch on types other than fixnums is largely irrelevant in Factor. Indeed, the overwhelming majority of calls to generic arithmetic operations involve fixnums, so some kind of trick to avoid two indirect jumps in this common case is needed.There are many solutions, but the one I settled on is pretty much the simplest one. In Factor, all values are tagged pointers (except for unboxed floats and such, which the compiler uses, but they cannot be "observed" in their unboxed state from the outside). A tagged pointer is a value where the low 3 bits encode a type, and the remaining bits encode a value. If the type tag is 0, the value is interpreted as an integer; otherwise, it is a pointer. All values in the heap are aligned on 8 byte boundaries, so if we mask off the 3 tag bits, we get a valid pointer.
So if we have two values, stored in registers
x and y, then both have a tag of 0 if their inclusive bitwise or has a tag of zero. That is, we can check if both are fixnums with a single conditional.Here is the new Factor code generated for the dispatch performed by
+:( scratchpad ) \ + def>> .
[
both-fixnums?
[ { fixnum fixnum } declare fixnum=>+ ] [
over tag {
[
dup tag {
[ { fixnum fixnum } declare fixnum=>+ ]
[
{ fixnum bignum } declare
[ >bignum ] dip bignum=>+
]
[
{ fixnum tuple } declare
\ + no-math-method
]
[ \ + no-math-method ]
[ { fixnum ratio } declare ratio=>+ ]
[
{ fixnum float } declare [ >float ] dip
float=>+
]
[ { fixnum complex } declare complex=>+ ]
[
{ fixnum POSTPONE: f } declare
\ + no-math-method
]
} dispatch
]
[
dup tag {
[
{ bignum fixnum } declare
>bignum bignum=>+
]
[ { bignum bignum } declare bignum=>+ ]
[
{ bignum tuple } declare
\ + no-math-method
]
[ \ + no-math-method ]
[ { bignum ratio } declare ratio=>+ ]
[
{ bignum float } declare [ >float ] dip
float=>+
]
[ { bignum complex } declare complex=>+ ]
[
{ bignum POSTPONE: f } declare
\ + no-math-method
]
} dispatch
]
[ \ + no-math-method ]
[ \ + no-math-method ]
[
dup tag {
[ { ratio fixnum } declare ratio=>+ ]
[ { ratio bignum } declare ratio=>+ ]
[
{ ratio tuple } declare
\ + no-math-method
]
[ \ + no-math-method ]
[ { ratio ratio } declare ratio=>+ ]
[
{ ratio float } declare [ >float ] dip
float=>+
]
[ { ratio complex } declare complex=>+ ]
[
{ ratio POSTPONE: f } declare
\ + no-math-method
]
} dispatch
]
[
dup tag {
[
{ float fixnum } declare
>float float=>+
]
[
{ float bignum } declare
>float float=>+
]
[
{ float tuple } declare
\ + no-math-method
]
[ \ + no-math-method ]
[ { float ratio } declare >float float=>+ ]
[ { float float } declare float=>+ ]
[ { float complex } declare complex=>+ ]
[
{ float POSTPONE: f } declare
\ + no-math-method
]
} dispatch
]
[
dup tag {
[ { complex fixnum } declare complex=>+ ]
[ { complex bignum } declare complex=>+ ]
[
{ complex tuple } declare
\ + no-math-method
]
[ \ + no-math-method ]
[ { complex ratio } declare complex=>+ ]
[ { complex float } declare complex=>+ ]
[ { complex complex } declare complex=>+ ]
[
{ complex POSTPONE: f } declare
\ + no-math-method
]
} dispatch
]
[ \ + no-math-method ]
} dispatch
] if
]
Notice that it is essentially the same as the old version, except the two-level jump table has been wrapped in a conditional. If
both-fixnums? outputs true, we go directly to the fixnum case; otherwise we do the two-level dispatch.Since
both-fixnums? needs to bitwise OR pointers together, I had no choice but to make it a VM primitive. The non-optimizing compiler inlines a fixed code sequence for calls of both-fixnums?, and the optimizing compiler expands it into a series of low-level IR operations:( scratchpad ) [ both-fixnums? [ "Fixnums!" ] [ "At least one is not a fixnum" ] if ] test-mr mr.
=== word: ( gensym ), label: ( gensym )
_label 0
##prologue
_label 1
##peek V int-regs 692262 D 0
##peek V int-regs 692263 D 1
##copy V int-regs 692264 V int-regs 692262
##or V int-regs 692264 V int-regs 692264 V int-regs 692263
##copy V int-regs 692265 V int-regs 692264
##and-imm V int-regs 692265 V int-regs 692265 7
_compare-imm-branch 3 V int-regs 692265 0 cc/=
_label 2
##inc-d 1
##load-indirect V int-regs 692269 "Fixnums!"
##replace V int-regs 692269 D 0
##epilogue
##return
_label 3
##inc-d 1
##load-indirect V int-regs 692270 "At least one is not a fixnum"
##replace V int-regs 692270 D 0
_label 4
##epilogue
##return
So the machine code generated for a generic arithmetic word such as
+ now begins as follows:mov (%r14),%rax
mov -0x8(%r14),%rcx
or %rcx,%rax
and $0x7,%rax
cmp $0x0,%rax
jne slow_dispatch
... handle fixnum+fixnum here ...
ret
slow_dispatch:
... do table dispatch here ...
We load two values from the data stack, or them together, mask off the low 3 bits, compare with zero, and jump. Indeed, those readers who know the ways of x86 assembly will recognize that there is a shorter code sequence which can accomplish this same task:
mov (%r14),%rax
mov -0x8(%r14),%rcx
or %rcx,%rax
test $0x7,%rax
jnz slow_dispatch
... handle fixnum+fixnum here ...
ret
slow_dispatch:
... do table dispatch here ...
Once I figure out a nice way to incorporate the
test instruction into my code generator and value numbering passes, I will switch things up and eliminate the unnecessary instruction there.This takes care of making dispatch faster. The next challenge was speeding up integer overflow handling. If the optimizing compiler's high-level optimizer can prove that a call to
+ has fixnum inputs, and in addition, the result will not overflow -- either because the ranges of the input values are constrained, or the output is passed to bitand -- it will replace it with a call to the fixnum+fast primitive. The low-level optimizer turns a call to fixnum+fast into a single low-level IR instruction, and this becomes a single machine instruction. Very efficient.However, if the high-level optimizer cannot prove that the result will not overflow, but it does know the inputs are fixnums, it replaces the call to
+ with a call to fixnum+. This still avoids dispatch overhead, however, until now, the low-level optimizer did not do anything special for fixnum+; it would compile it as a subroutine call of a VM function, which added two integers, checking for overflow, in C. This was slow, because C does not have a way to test the overflow flag of the CPU.I made the low-level optimizer aware of the
fixnum+ word; it converts it into a low-level instruction which can then participate in value numbering and register allocation. The CPU-specific backend is then responsible for generating an optimal machine code sequence for fixnum+. On x86 and PowerPC, this can use the overflow flag in the CPU, which is a more efficient than trying to simulate the same in C. If the overflow flag is set after the addition, fixnum+ performs a subroutine call into the VM, however there is no call and the arithmetic is done inline in the common case, where there is no overflow.Here is an example of machine code emitted for
fixnum+:sub $0x8,%r14
mov (%r14),%rax
mov 0x8(%r14),%rcx
add %rcx,%rax
jno no_overflow
... handle overflow ...
no_overflow:
mov %rax,(%r14)
retq
I omitted the lengthy setup for the subroutine call into the VM, and register assignments will vary depending on the placement of
fixnum+ in the code.The results were very interesting. Some benchmarks, such as spectral-norm and mandelbrot, showed no improvement, because the compiler eliminates all the dispatch and overflow checks from the inner loops anyway. Other benchmarks showed an impressive gain. This is good because it means that I didn't waste my time with implementing the above, but it is also bad because it shows that the high-level optimizer could probably do a better job of inferring types in some of these benchmarks!
Here are the results; I'm running each benchmark 5 times and measuring the runtime in milliseconds.
| Benchmark | Before | After | 
|---|---|---|
| fasta | 9492 8985 8975 9310 9219 | 7841 7818 8115 7807 7777 | 
| reverse-complement | 5815 5611 5628 5645 5645 | 5176 5152 5177 5173 5289 | 
| binary-trees | 2911 2867 2870 2890 2894 | 2163 2142 2186 2168 2136 | 
| fib2 | 113 112 114 112 112 | 82 83 84 85 85 | 
| fib3 | 301 302 301 300 301 | 260 261 259 260 260 | 
| project-euler.134 | 427 427 427 426 424 | 314 313 315 318 313 | 
 
 
3 comments:
RE: "jno no_overflow"
Shouldn't you take a jump on the uncommon case and fall through on the common case where there is no overflow?
Wolf550e: you'd think so, but in my testing reversing the conditional actually lead to slower performance.
Instead of:
and $0x7,%rax
cmp $0x0,%rax
jne slow_dispatch
You can just do:
and $0x7,%rax
jnz slow_dispatch
(note that jnz and jne are actually the same opcode). All arithmetic and bitwise ops on x86 family set the zero flag.
Best way to introduce this, I think is through a peephole optimizer; It's been a while since I've looked at Factor, I don't recall if it has one. But you can drop the 2nd instruction in a sequence matching:
and|or|xor \1,\2
cmp $0x0, \2
je|jne \3
Post a Comment