I implemented polymorphic inline caching and megamorphic caching to replace Factor's previous method dispatch implementation.
Call sites
A generic word's call site is in one of three states:
- Cold state - the call instruction points to a stub which looks at the class of the object being dispatched upon, generates a new PIC stub containing a single entry, updates the call site, and jumps to the right method.
- Inline cache state - the call instruction points to a generated PIC stub which has one or more class checks, followed by a jump to a method if the checks succeed. If none of the checks succeed, it jumps to the inline cache miss routine. The miss routine does one of two things:
- If the PIC stub already has the maximum number of entries, it sets the call site to point to the generic word's megamorphic stub, shared by all megamorphic call sites.
- If the PIC stub has less than the maximum number of entries, a new PIC stub is generated, with the same set of entries as the last one, together with a new entry for the class being dispatched on.
- Megamorphic state - the call instruction points to the generic word's megamorphic stub. Further dispatches do not modify the call instruction.
When code is loaded into the image, or a full garbage collection occurs, all call sites revert to the cold state. Loading code might define new methods or change class relationships, so caches have to be invalidated in that case to preserve correct language semantics. On a full GC, caches are reverted so that serially monomorphic code is optimized better; if a generic word is called in a long loop with one type, then in a long loop with another type, and so on, it will eventually become megamorphic. Resetting call sites once in a while (a full GC is a relatively rare event) ensures that inline caches better reflect what is going on with the code right now. After implementing PICs, I learned that the V8 JavaScript VM uses the same strategy to invalidate call sites once in a while, so it sounds like I did the right thing.
Direct class checks versus subclass checks
Factor supports various forms of subtyping. For example, tuple classes can inherit from other tuple classes, and also, a union class can be defined whose members are an arbitrary set of classes. When a method is defined on a class, it will apply to all subclasses as well, unless explicitly overriden. Checking if an object is an instance of a class X is a potentially expensive operation, because it is not enough to compare direct classes, superclasses and union members have to be inspected also. Polymorphic inline caching only deals with direct classes, because direct class checks are much quicker (for tuples, just compare the tuple layouts for pointer equality; for instances of built-in classes, compare tag numbers). So for example if a generic word has a single method for the sequence class, and a call site calls the generic with arrays and strings, then the inline cache generated there will have two checks, one for strings, and one for arrays, and both checks will point at the same method.
Inline cache stubs
Inline cache stubs are generated by the VM in the file inline_cache.cpp. This generates machine code, but I'll use C-like pseudocode for examples instead. An inline cache stub looks like this, where
index
is the parameter number being dispatched on (generic words defined with GENERIC: use 0, generic words defined with GENERIC# can dispatch on any parameter):
void *obj = datastack[index];
void *class = class_of(obj);
if(class == cached_class_1) cached_method_1();
else if(class == cached_class_2) cached_methd_2();
...
else cache_miss(call_site_address,index,class,generic);
Getting the direct class of an object (what I call
class_of()
in the pseudocode) is a relatively cheap operation. I briefly discussed pointer tagging, type headers and tuple layout headers in my C++ VM post. In the most general case, getting the direct class looks like this, again in vague C-like pseudocode:if(tag(obj) == hi_tag) return obj.header;
else if(tag(obj) == tuple_layout) return obj.layout;
else return tag(obj);
However, if the cached classes consist only of tag classes (the 7 built-in types whose instances can be differentiated based on tag alone) then the generated stub doesn't need to do the other checks; it just compares the pointer tag against the cached classes. Similarly, if tag and tuple classes are in the cache but no hi-tag classes, one branch can be avoided. In total, the PIC generation code uses one of four stubs for getting the object's class, and the right stub to use is determined by looking at the cached classes (see the function
determine_inline_cache_type()
in inline_cache.cpp
).Once the stub has been generated, it is installed at the call site and invoked. This way, a PIC miss is transparent to the caller.
Patching call sites
The code for getting and setting jump targets from jump and call instructions is of course CPU-specific; see the
get_call_target()
and set_call_target()
functions, and note that on PowerPC, the instruction cache needs to be flushed manually:Recall that there are two situations in which the PIC miss code can be invoked:
- For a cold call site, to generate the initial stub
- For a polymorphic call site, to add a new entry to the PIC
If the call site is a non-tail call, then the return address will be pushed on the stack (x86) or stored in a link register (PowerPC). If the call site is a tail call, then Factor's code generator now moves the current instruction pointer into a register before performing the tail call. This code is generated even when tail-calling words that are not generic; it is a no-op in this case. This is only a single integer load, so other than increasing code size it really has no effect on performance, since on a modern out-of-order CPU the load and jump will execute concurrently. For example, here is how the code snippet
beer get length
compiles on x86:0x000000010b8628c0: mov $0x10b8628c0,%r8
0x000000010b8628ca: pushq $0x20
0x000000010b8628cf: push %r8
0x000000010b8628d1: sub $0x8,%rsp
0x000000010b8628d5: add $0x10,%r14
0x000000010b8628d9: mov $0x10a6c8186,%rax
0x000000010b8628e3: mov %rax,-0x8(%r14)
0x000000010b8628e7: mov $0x100026c60,%rax
0x000000010b8628f1: mov (%rax),%rcx
0x000000010b8628f4: mov %rcx,(%r14)
0x000000010b8628f7: callq 0x10b1e0de0
0x000000010b8628fc: add $0x18,%rsp
0x000000010b862900: mov $0x10b86290f,%rbx
0x000000010b86290a: jmpq 0x10b70f4a0
The first few instructions set up a call stack frame, then the
beer
symbol is pushed on the stack, and the (inlined) definition of get
follows. Finally, the address of the jump instruction is loaded into the rbx
register and a jump is performed. Initially, the jump points at the PIC miss stub, so the first time this machine code executes, the jump instruction's target will be changed to a newly-generated PIC.However, the
inline_cache_miss()
function in the VM takes the call site address as an input parameter, and after all, it is written in C++. Where does this input parameter actually come from? The answer is that there are short assembly stubs that sit between the actual call site and the PIC miss code in the VM. The stubs get the return address in a CPU-specific manner, and then call the VM. This code is in the primitive_inline_cache_miss
procedure, which is again defined in several places:Note that the functions have two entry points. The first entry point is taken for non-tail call sites, the second one is taken for tail call sites.
An optimization
Since allocation of PICs is something that happens rather frequently, it is good to optimize this operation. I implemented segregated free lists for the code heap. Allocating lots of small code heap blocks requires less work now. Also, since every PIC is referenced from at most one call site, when a call site's PIC is regenerated with a new set of cached classes, the only PIC can be returned to the free list immediately. This reduces the frequency of full garbage collections (when the code heap fills up, a full GC must be performed; there is no generational tenuring for code blocks).
Megamorphic caching
If a call site is called with more than a few distinct classes of instances (the default maximum PIC size is 3 entries) then it is patched to call the megamorphic dispatch stub for that generic word. Megamorphic stubs are again generated in the VM, in the dispatch.cpp source file. The machine code that is generated, were it to be expressed in pseudo-C, looks like this:
void *obj = datastack[index];
void *class = class_of(obj);
if(cache[class] == class) goto cache[class + 1];
else megamorphic_cache_miss(class,generic,cache);
Every generic word has a fixed-size megamorphic cache of 32-entries (this can be changed by bootstrapping a new image, but I'm not sure there's any point). It works like a hashtable, except if there is a collision, the old entry is simply evicted; there is no bucket chaining or probing strategy here. While less efficient than a polymorphic inline cache hit (which only entails a direct jump), megamorphic cache hits are still rather quick (some arithmetic and an indirect jump). There are no calls into the VM, it is all inline assembly. A megamorphic inline cache miss calls the
mega_cache_miss
primitive which computes the correct method and updates the cache.Performance
I added some instrumentation code to the VM which counts the number of cache hits and misses. The majority of call sites are monomorphic or polymorphic, and megamorphic call sites are very rare. Megamorphic cache misses are rarer still.
The performance gain for benchmark runtimes was as I expected; a 5-10% gain. The old method dispatch system was already pretty efficient, and modern CPUs have branch prediction which helped a lot with it. For compile time, the gain was a lot more drastic, however, and definitely made the effort I put into implementing PICs pay off; bootstrap is now a full minute faster than before, clocking in at around 3 minutes, and the
load-all
word, which loads and compiles all the libraries in the distribution, used to take 25 minutes and now takes 11 minutes. All timings are from my MacBook Pro.The compile-time gain is not a direct result of the inline caching, but rather stems from the fact that the compiler has to compile less code. With the old method dispatch implementation, every generic word was an ordinary word under the hood, with a huge, auto-generated body containing conditionals and jump tables. When a method was added or removed, the dispatcher code had to be re-generated and re-compiled, which takes a long time. With polymorphic inline caching, what happens is now the VM essentially lazily generates just those parts of the dispatch code which are actually used, and since it uses the VM's JIT code instead of the optimizing compiler, it can just glue bits of machine code together instead of building an IR and optimizing it; which yielded no benefits for the type of code that appeared in a generic word body anyway.
Improving compile time is one of my major goals for Factor 1.0, since it improves user experience when doing interactive development, and I'm very happy with the progress so far.