Tuesday, May 26, 2009

Factor's implementation of polymorphic inline caching

Polymorphic inline caching is a technique to speed up method dispatch by generating custom code at method call sites that checks the receiver against only those classes that have come up at that call site. If the receiver's class does not appear in the polymorphic inline cache (this case is known as a "cache miss"), then a new call stub is generated, which checks the receiver's class in addition to the classes already in the stub. This proceeds until the stub reaches a maximum size, after which point further cache misses rewrite the caller to call a general "megamorphic" stub which performs a full method dispatch. A full description can be found in the original paper on the topic.

I implemented polymorphic inline caching and megamorphic caching to replace Factor's previous method dispatch implementation.

Call sites

A generic word's call site is in one of three states:
  • Cold state - the call instruction points to a stub which looks at the class of the object being dispatched upon, generates a new PIC stub containing a single entry, updates the call site, and jumps to the right method.
  • Inline cache state - the call instruction points to a generated PIC stub which has one or more class checks, followed by a jump to a method if the checks succeed. If none of the checks succeed, it jumps to the inline cache miss routine. The miss routine does one of two things:
    • If the PIC stub already has the maximum number of entries, it sets the call site to point to the generic word's megamorphic stub, shared by all megamorphic call sites.
    • If the PIC stub has less than the maximum number of entries, a new PIC stub is generated, with the same set of entries as the last one, together with a new entry for the class being dispatched on.
  • Megamorphic state - the call instruction points to the generic word's megamorphic stub. Further dispatches do not modify the call instruction.

When code is loaded into the image, or a full garbage collection occurs, all call sites revert to the cold state. Loading code might define new methods or change class relationships, so caches have to be invalidated in that case to preserve correct language semantics. On a full GC, caches are reverted so that serially monomorphic code is optimized better; if a generic word is called in a long loop with one type, then in a long loop with another type, and so on, it will eventually become megamorphic. Resetting call sites once in a while (a full GC is a relatively rare event) ensures that inline caches better reflect what is going on with the code right now. After implementing PICs, I learned that the V8 JavaScript VM uses the same strategy to invalidate call sites once in a while, so it sounds like I did the right thing.

Direct class checks versus subclass checks

Factor supports various forms of subtyping. For example, tuple classes can inherit from other tuple classes, and also, a union class can be defined whose members are an arbitrary set of classes. When a method is defined on a class, it will apply to all subclasses as well, unless explicitly overriden. Checking if an object is an instance of a class X is a potentially expensive operation, because it is not enough to compare direct classes, superclasses and union members have to be inspected also. Polymorphic inline caching only deals with direct classes, because direct class checks are much quicker (for tuples, just compare the tuple layouts for pointer equality; for instances of built-in classes, compare tag numbers). So for example if a generic word has a single method for the sequence class, and a call site calls the generic with arrays and strings, then the inline cache generated there will have two checks, one for strings, and one for arrays, and both checks will point at the same method.

Inline cache stubs

Inline cache stubs are generated by the VM in the file inline_cache.cpp. This generates machine code, but I'll use C-like pseudocode for examples instead. An inline cache stub looks like this, where index is the parameter number being dispatched on (generic words defined with GENERIC: use 0, generic words defined with GENERIC# can dispatch on any parameter):

void *obj = datastack[index];
void *class = class_of(obj);
if(class == cached_class_1) cached_method_1();
else if(class == cached_class_2) cached_methd_2();
else cache_miss(call_site_address,index,class,generic);

Getting the direct class of an object (what I call class_of() in the pseudocode) is a relatively cheap operation. I briefly discussed pointer tagging, type headers and tuple layout headers in my C++ VM post. In the most general case, getting the direct class looks like this, again in vague C-like pseudocode:
if(tag(obj) == hi_tag) return obj.header;
else if(tag(obj) == tuple_layout) return obj.layout;
else return tag(obj);

However, if the cached classes consist only of tag classes (the 7 built-in types whose instances can be differentiated based on tag alone) then the generated stub doesn't need to do the other checks; it just compares the pointer tag against the cached classes. Similarly, if tag and tuple classes are in the cache but no hi-tag classes, one branch can be avoided. In total, the PIC generation code uses one of four stubs for getting the object's class, and the right stub to use is determined by looking at the cached classes (see the function determine_inline_cache_type() in inline_cache.cpp).

Once the stub has been generated, it is installed at the call site and invoked. This way, a PIC miss is transparent to the caller.

Patching call sites

The code for getting and setting jump targets from jump and call instructions is of course CPU-specific; see the get_call_target() and set_call_target() functions, and note that on PowerPC, the instruction cache needs to be flushed manually:

Recall that there are two situations in which the PIC miss code can be invoked:
  • For a cold call site, to generate the initial stub
  • For a polymorphic call site, to add a new entry to the PIC

If the call site is a non-tail call, then the return address will be pushed on the stack (x86) or stored in a link register (PowerPC). If the call site is a tail call, then Factor's code generator now moves the current instruction pointer into a register before performing the tail call. This code is generated even when tail-calling words that are not generic; it is a no-op in this case. This is only a single integer load, so other than increasing code size it really has no effect on performance, since on a modern out-of-order CPU the load and jump will execute concurrently. For example, here is how the code snippet beer get length compiles on x86:
0x000000010b8628c0: mov    $0x10b8628c0,%r8
0x000000010b8628ca: pushq $0x20
0x000000010b8628cf: push %r8
0x000000010b8628d1: sub $0x8,%rsp
0x000000010b8628d5: add $0x10,%r14
0x000000010b8628d9: mov $0x10a6c8186,%rax
0x000000010b8628e3: mov %rax,-0x8(%r14)
0x000000010b8628e7: mov $0x100026c60,%rax
0x000000010b8628f1: mov (%rax),%rcx
0x000000010b8628f4: mov %rcx,(%r14)
0x000000010b8628f7: callq 0x10b1e0de0
0x000000010b8628fc: add $0x18,%rsp
0x000000010b862900: mov $0x10b86290f,%rbx
0x000000010b86290a: jmpq 0x10b70f4a0

The first few instructions set up a call stack frame, then the beer symbol is pushed on the stack, and the (inlined) definition of get follows. Finally, the address of the jump instruction is loaded into the rbx register and a jump is performed. Initially, the jump points at the PIC miss stub, so the first time this machine code executes, the jump instruction's target will be changed to a newly-generated PIC.

However, the inline_cache_miss() function in the VM takes the call site address as an input parameter, and after all, it is written in C++. Where does this input parameter actually come from? The answer is that there are short assembly stubs that sit between the actual call site and the PIC miss code in the VM. The stubs get the return address in a CPU-specific manner, and then call the VM. This code is in the primitive_inline_cache_miss procedure, which is again defined in several places:

Note that the functions have two entry points. The first entry point is taken for non-tail call sites, the second one is taken for tail call sites.

An optimization

Since allocation of PICs is something that happens rather frequently, it is good to optimize this operation. I implemented segregated free lists for the code heap. Allocating lots of small code heap blocks requires less work now. Also, since every PIC is referenced from at most one call site, when a call site's PIC is regenerated with a new set of cached classes, the only PIC can be returned to the free list immediately. This reduces the frequency of full garbage collections (when the code heap fills up, a full GC must be performed; there is no generational tenuring for code blocks).

Megamorphic caching

If a call site is called with more than a few distinct classes of instances (the default maximum PIC size is 3 entries) then it is patched to call the megamorphic dispatch stub for that generic word. Megamorphic stubs are again generated in the VM, in the dispatch.cpp source file. The machine code that is generated, were it to be expressed in pseudo-C, looks like this:
void *obj = datastack[index];
void *class = class_of(obj);
if(cache[class] == class) goto cache[class + 1];
else megamorphic_cache_miss(class,generic,cache);

Every generic word has a fixed-size megamorphic cache of 32-entries (this can be changed by bootstrapping a new image, but I'm not sure there's any point). It works like a hashtable, except if there is a collision, the old entry is simply evicted; there is no bucket chaining or probing strategy here. While less efficient than a polymorphic inline cache hit (which only entails a direct jump), megamorphic cache hits are still rather quick (some arithmetic and an indirect jump). There are no calls into the VM, it is all inline assembly. A megamorphic inline cache miss calls the mega_cache_miss primitive which computes the correct method and updates the cache.


I added some instrumentation code to the VM which counts the number of cache hits and misses. The majority of call sites are monomorphic or polymorphic, and megamorphic call sites are very rare. Megamorphic cache misses are rarer still.

The performance gain for benchmark runtimes was as I expected; a 5-10% gain. The old method dispatch system was already pretty efficient, and modern CPUs have branch prediction which helped a lot with it. For compile time, the gain was a lot more drastic, however, and definitely made the effort I put into implementing PICs pay off; bootstrap is now a full minute faster than before, clocking in at around 3 minutes, and the load-all word, which loads and compiles all the libraries in the distribution, used to take 25 minutes and now takes 11 minutes. All timings are from my MacBook Pro.

The compile-time gain is not a direct result of the inline caching, but rather stems from the fact that the compiler has to compile less code. With the old method dispatch implementation, every generic word was an ordinary word under the hood, with a huge, auto-generated body containing conditionals and jump tables. When a method was added or removed, the dispatcher code had to be re-generated and re-compiled, which takes a long time. With polymorphic inline caching, what happens is now the VM essentially lazily generates just those parts of the dispatch code which are actually used, and since it uses the VM's JIT code instead of the optimizing compiler, it can just glue bits of machine code together instead of building an IR and optimizing it; which yielded no benefits for the type of code that appeared in a generic word body anyway.

Improving compile time is one of my major goals for Factor 1.0, since it improves user experience when doing interactive development, and I'm very happy with the progress so far.


Sohail Somani said...

Hey, just a question out of curiosity. How are you able to keep up development, and at such practical levels? Are you using this in your work? Can you talk about it?

Keep it up, I geek through you.

Sohail Somani said...

PS: been following you since you implemented scripting for your game in Java. I think I still have a screenshot you posted somewhere :-)

Wolf550e said...

When you go multithreaded, how will you update the generated code atomically?

Sohail: seems that Factor is Slava's day job, so he is funded somehow, perhaps by being independently wealthy or by a benefactor.

Anonymous said...

Hi Slava, there's a more comprehensive archive of Self papers in PDF at the selflanguage.org site that you might find useful. --Russell

Anonymous said...

LTU has a thread on your current Factor progress, and I was curious about your work schedule too.

Eliot said...

Hi Slava,

did you look at introducing another state between the cold and the polymorphic inline cache state? I find that in Smalltalk the vast majority of linked sends have only 1 receiver type. Essentially rough numbers are in methods that get jitted the ration of untaken sends (that remain as stubs) to taken sends (that get linked) is about 35:65. Of the 65% of taken sends the rough numbers are 90 monomorphic to 9 polymorphic (of degree ~< 6) to 1 megamorphic. With these numbers it makes sense to provide a monomorphic entry-point in each method and first link the stub directly to the target method, only allocating a PIC if the monomorphic entry check fails. You should reduce your PIC allocation rate by about a factor of 10 if Factor's polymorphism is at or below that of typical Smalltalk.

A further tweak I added in my Cog VM recently is to keep note of megamorphic selectors and convert a monomorphic send to a megamorphic send when the monomorphic send of a polymorphic selector fails. I do this differently from you. My megamorphic sends are handled by what I call Open PICs (Closed PICs being the conventional fixed maximum size dispatch table). An Open PIC is a hash lookup of a method cache, aborting to a full lookup on failure. Since the number of Open PICs is small (< 10% of all PICs) I keep them in a linked list and search the list for an Open PCI with a matching selector whenever a monomorphic send fails. This both cuts down on duplication of Open PICs and saves time evolving a Closed PIC that is very likely to become megamorphic.

Hope this makes sense.