Thursday, July 09, 2009

Improvements to Factor's register allocator

For the last couple of months I've been working on some improvements to Factor's low-level optimizer and code generator. The major new change is that the control flow graph IR is now allowed to define registers that are used in more than one basic block. Previously, all values would be loaded from the data stack at the start of a basic block and stored to the stack at the end. Register allocation in this case is called local register allocation. Now the only time when values have to be saved to the stack is for subroutine calls. The part of the compiler most affected by this is of course the register allocator, because now it needs to do global register allocation.

I will describe the new optimizations in a future blog post and talk about global register allocation here. To help people who are learning Factor or are simply curious about what real Factor code looks like, I've inserted lots of links from this blog post to our online code repository browser.

Instead of rewriting the register allocator from scratch, I took a more incremental approach, gradually refactoring it to operate on the entire control flow graph instead of a basic block at a time. Generalizing the linear scan algorithm to do this is relatively straightforward, but there are some subtleties. Once again, I used Christian Wimmer's masters thesis as a guide, however this time around I implemented almost the entire algorithm described there, instead of the simplification to the local case.

First, recall some terminology: virtual registers are an abstraction of an ideal CPUs register file: there can be arbitrarily many of them, and each one is only ever assigned to once. The register allocator's job is to rewrite the program to use physical registers instead, possibly mapping a number of virtual registers to the same physical register, and inserting spills and reloads if there are insufficient physical registers to store all temporary values used by the program.

Linearization


A local register allocator operates on a basic block, which is a linear list of instructions. For the global case, linear scan still requires a linear list of instructions as input, so the first step is to traverse the CFG in reverse post order and number the instructions.

Building live intervals


Once a linear ordering among all instructions has been stablished, the second step in linear scan register allocation is building live intervals. In the local case, a virtual register is defined in the same basic block as all of its usages. Furthermore, if we assume single assignment, then there will be only a single definition, and it will precede all uses. So the set of instructions where the register is live is a single interval, which begins at the definition and ends at the last use.

In the global case, the situation is more complicated. First of all, because the linearization chosen is essentially arbitrary, and does not reflect the actual control flow in reality, instructions are not necessarily executed in order, so if a register is defined at position A and used at position B, it does not mean that the register is live at every position between A and B. Indeed, in the global case, the live "interval" of a register may consist of a number of disjoint ranges of instructions.

As an example, consider the following C program:
1: int x = ...;
2: if(...) {
3: ... a bunch of code that does not use x
4: } else {
5: int y = x + 1;
6: ... code that doesn't use x
7: }
8: ... more code that doesn't use x

The value x is live at position 1, and also at position 5. The last usage of x is at line 5, so it is not live at any position after 5, but also, it is not live at position 3 either, since if the true branch is taken, then x is never used at all. We say that x has a lifetime hole at position 3.

Another case where lifetime holes come up is virtual registers with multiple definitions. The very last step in the low level optimizer, right before register allocation, is phi node elimination. Essentially, phi node elimination turns this,
if(informal) {
x1 = "hi";
} else {
x2 = "hello";
}
x = phi(x1,x2);

into the following:
if(informal) {
x1 = "hi";
x = x1;
} else {
x2 = "hello";
x = x2;
}

Now, the value x is not live at the line x2 = "hello";, because it hasn't been defined yet, and the previous definition is not available either. So the lifetime interval for x has a hole at this location.

To represent such complex live intervals, the live-interval data type now contains a list of live ranges. Since basic blocks cannot contain control flow or multiple definitions, there can be at most one live range per basic block per live interval. (The only time multiple definitions come up is as a result of Phi node elimination, and since the input is in SSA form this cannot introduce multiple assignments in the same block).

Information from liveness analysis is used to construct live ranges. A basic block's live range contains the first definition of the register as well as the last use. If the register is in the "live in" set of the basic block, then the range begins at the block's first instruction, and if the register is in the "live out" set then the range ends at the block's last instruction. Note that even if a basic block has no usages of a register, it may appear in both the live in and live out sets; for example, the register might be defined in a predecessor block, and used in a successor block.

Allocating physical registers to live intervals


Once live intervals have been built, the next step is to assign physical registers to them. During this process, live intervals may be split, with register to memory, memory to register and register to register moves inserted in between; a single virtual register may be in different physical registers and memory locations during its lifetime.

In the local case, the allocation algorithm maintains two pieces of state; the "unhandled set", which is a heap of intervals sorted by start position, and an "active set", which is a list of intervals currently assigned to physical registers. The main algorithm removes the smallest element from the unhandled set, and then removes anything from the active set which ends before the new interval begins. The next step depends on whether or not any physical registers are free. If all physical registers are taken up in the active set, then a decision is made to spill either the new interval, or a member of the active set. Spilling will free up at least one physical register. At this point, a physical register is now free and can be assigned to the new interval, which is then added to the active set. Spilling may split intervals into smaller pieces, and add new elements to the unhandled set, but since the new split intervals are strictly smaller than the original interval, the process eventually terminates. Once the unhandled set is empty, allocation is complete.

The global case is more complicated, primarily due to the presence of lifetime holes in live intervals. A third piece of state is maintained by the allocation pass. This new set, the "inactive set", contains live intervals which have not ended yet, but are currently in a lifetime hole. To illustrate, suppose we have two physical registers, R1 and R2, and three virtual registers, V1, V2, and V3, with the following live intervals:

[Program start ....... Program end]
V1: <======> <=========>
V2: <==============>
V3: <================>
^
|
Current position

Immediately before V3's live interval begins, the active set includes V2, and the inactive set contains V1, since V1 is not finished yet, but the current position is in a lifetime hole. Suppose that V1 was assigned to physical register R1, and V2 was assigned to physical register R2. In the local case, since the active set only includes one element, V1, we could conclude that R2 was free, and that V3 could be assigned to R2. However, this is invalid, since at a later point, V1 becomes active again, and V1 uses R2. Two overlapping live intervals cannot use the same physical register. So global linear scan also has to examine the inactive set in order to see if the new live interval can fit in an inactive interval's lifetime hole. In this case, it cannot, so we have to split V3, and insert a copy instruction in the middle:
      [Program start ....... Program end]
V1: <======> <=========>
V2: <==============>
V3_1: <=========>
V3_2: <====>
^
|
Current position

With V3 split into V3_1 and V3_2, a valid register assignment is possible, with V3_1 stored in R1 and V3_2 stored in R2.

In this case, a register assignment without any spilling is possible. However, sometimes, virtual registers have to be stored to memory. This is done by the spilling code which uses a similar algorithm to the local case. The main complication is that in order to free up a physical register for the entire lifetime of the new interval, more than one interval may need to be split; zero or one active intervals, and zero or more inactive intervals.

Assigning physical registers to instructions


After live intervals are split, and physical registers are assigned to live intervals, the assignment pass iterates over all instructions, storing a mapping from virtual registers to physical registers in each instruction. Recall that this cannot be a global mapping, since a virtual register may reside in several different physical registers at different locations in the program.

The algorithm here is essentially unchanged from the local case. Two additions are that at the start and end of each basic block, the assignment pass records which registers are live. This information is used by GC checks. GC checks are performed on basic block boundaries of blocks which allocate memory. Since a register may be defined in one basic block, and used in another, with a GC check in between, GC checks need to store registers which contain pointers to a special location in the call stack and pass a pointer to this location to the GC. Some language implementations use a conservative GC to get around having to do this, but I think recording accurate pointer root information is a superior approach. This liveness information is also used by the resolve pass, coming up next.

The resolve pass


Since linear scan does not take control flow into account, interval splitting will give incorrect results if implemented naively. Consider the following example program:
x = ...;

if(...) {
...
y = x + 1;
... some code with high register pressure here
} else {
z = x + 2;
}

If the live interval of x was spilled immediately after the line y = x + 1; then the register allocator would rewrite it as follows -- it would be spilled after the last usage before the spill point, and reloaded before the first usage after the spill point:
x = ...;

if(...) {
...
y = x + 1;
spill(x);
... some code with high register pressure here
} else {
reload(x);
z = x + 2;
}

In this case however, the spill and reload locations are in different control flow paths. However, linear scan does not take control flow into account during the main algorithm. Instead, a subsequent resolve pass. The resolve pass looks at every edge in the control flow graph, and compares the physical register and spill slot assignments of all virtual registers which are both live across the edge. If they differ, moves, spills and reloads are inserted on the edge. Conceptually, this pass is simple, but it feels hackish -- it seems that a more elegant register allocator would incorporate this into the previous stages of allocation. However, all formulations of linear scan I've seen so far work this way. The only tricky part about the resolve pass is that the moves have to be performed "in parallel"; for example, suppose we have two virtual registers V1 and V2, and two physical registers R1 and R2. Block A and block B are joined by an edge. At the end of block A, V1 is in R1, and V2 is in R2. At the start of block B, V1 is in R2 and V2 is in R1. Doing the moves in any order would give the wrong result; for correct behavior, the cycle has to be broken with a third temporary location. The logic for doing this in the general case is a bit complex; Doug Coleman helped me implement the mapping code.

General observations


While implementing the register allocator improvements, I relied heavily on unit testing as well as an informal "design by contract"; basically, making extensive use of runtime assertions in the code itself. The compiler.cfg.linear-scan vocabulary weighs in at 3828 lines of code; 2484 lines of this are unit tests. This is certainly a much higher test:code ratio than most other Factor code I've written, and some of the tests are not very useful or redundant. However, they represent a log of features and bugs I've implemented, and at the very least, minimize regressions.

When testing the register allocator, I first got it working on x86-64. Here, there are enough registers that spilling almost never occurs, so I could focus on getting the primary logic correct. Then, I hacked the x86-64 backend to only use 4 physical registers, and debugged spilling.

The whole project took a bit longer than I had hoped, but I managed to start implementing a number of additional optimizations in the meantime. When they are more complete I will blog about them.

Tuesday, May 26, 2009

Factor's implementation of polymorphic inline caching

Polymorphic inline caching is a technique to speed up method dispatch by generating custom code at method call sites that checks the receiver against only those classes that have come up at that call site. If the receiver's class does not appear in the polymorphic inline cache (this case is known as a "cache miss"), then a new call stub is generated, which checks the receiver's class in addition to the classes already in the stub. This proceeds until the stub reaches a maximum size, after which point further cache misses rewrite the caller to call a general "megamorphic" stub which performs a full method dispatch. A full description can be found in the original paper on the topic.

I implemented polymorphic inline caching and megamorphic caching to replace Factor's previous method dispatch implementation.

Call sites


A generic word's call site is in one of three states:
  • Cold state - the call instruction points to a stub which looks at the class of the object being dispatched upon, generates a new PIC stub containing a single entry, updates the call site, and jumps to the right method.
  • Inline cache state - the call instruction points to a generated PIC stub which has one or more class checks, followed by a jump to a method if the checks succeed. If none of the checks succeed, it jumps to the inline cache miss routine. The miss routine does one of two things:
    • If the PIC stub already has the maximum number of entries, it sets the call site to point to the generic word's megamorphic stub, shared by all megamorphic call sites.
    • If the PIC stub has less than the maximum number of entries, a new PIC stub is generated, with the same set of entries as the last one, together with a new entry for the class being dispatched on.
  • Megamorphic state - the call instruction points to the generic word's megamorphic stub. Further dispatches do not modify the call instruction.

When code is loaded into the image, or a full garbage collection occurs, all call sites revert to the cold state. Loading code might define new methods or change class relationships, so caches have to be invalidated in that case to preserve correct language semantics. On a full GC, caches are reverted so that serially monomorphic code is optimized better; if a generic word is called in a long loop with one type, then in a long loop with another type, and so on, it will eventually become megamorphic. Resetting call sites once in a while (a full GC is a relatively rare event) ensures that inline caches better reflect what is going on with the code right now. After implementing PICs, I learned that the V8 JavaScript VM uses the same strategy to invalidate call sites once in a while, so it sounds like I did the right thing.

Direct class checks versus subclass checks


Factor supports various forms of subtyping. For example, tuple classes can inherit from other tuple classes, and also, a union class can be defined whose members are an arbitrary set of classes. When a method is defined on a class, it will apply to all subclasses as well, unless explicitly overriden. Checking if an object is an instance of a class X is a potentially expensive operation, because it is not enough to compare direct classes, superclasses and union members have to be inspected also. Polymorphic inline caching only deals with direct classes, because direct class checks are much quicker (for tuples, just compare the tuple layouts for pointer equality; for instances of built-in classes, compare tag numbers). So for example if a generic word has a single method for the sequence class, and a call site calls the generic with arrays and strings, then the inline cache generated there will have two checks, one for strings, and one for arrays, and both checks will point at the same method.

Inline cache stubs


Inline cache stubs are generated by the VM in the file inline_cache.cpp. This generates machine code, but I'll use C-like pseudocode for examples instead. An inline cache stub looks like this, where index is the parameter number being dispatched on (generic words defined with GENERIC: use 0, generic words defined with GENERIC# can dispatch on any parameter):

void *obj = datastack[index];
void *class = class_of(obj);
if(class == cached_class_1) cached_method_1();
else if(class == cached_class_2) cached_methd_2();
...
else cache_miss(call_site_address,index,class,generic);

Getting the direct class of an object (what I call class_of() in the pseudocode) is a relatively cheap operation. I briefly discussed pointer tagging, type headers and tuple layout headers in my C++ VM post. In the most general case, getting the direct class looks like this, again in vague C-like pseudocode:
if(tag(obj) == hi_tag) return obj.header;
else if(tag(obj) == tuple_layout) return obj.layout;
else return tag(obj);

However, if the cached classes consist only of tag classes (the 7 built-in types whose instances can be differentiated based on tag alone) then the generated stub doesn't need to do the other checks; it just compares the pointer tag against the cached classes. Similarly, if tag and tuple classes are in the cache but no hi-tag classes, one branch can be avoided. In total, the PIC generation code uses one of four stubs for getting the object's class, and the right stub to use is determined by looking at the cached classes (see the function determine_inline_cache_type() in inline_cache.cpp).

Once the stub has been generated, it is installed at the call site and invoked. This way, a PIC miss is transparent to the caller.

Patching call sites


The code for getting and setting jump targets from jump and call instructions is of course CPU-specific; see the get_call_target() and set_call_target() functions, and note that on PowerPC, the instruction cache needs to be flushed manually:


Recall that there are two situations in which the PIC miss code can be invoked:
  • For a cold call site, to generate the initial stub
  • For a polymorphic call site, to add a new entry to the PIC

If the call site is a non-tail call, then the return address will be pushed on the stack (x86) or stored in a link register (PowerPC). If the call site is a tail call, then Factor's code generator now moves the current instruction pointer into a register before performing the tail call. This code is generated even when tail-calling words that are not generic; it is a no-op in this case. This is only a single integer load, so other than increasing code size it really has no effect on performance, since on a modern out-of-order CPU the load and jump will execute concurrently. For example, here is how the code snippet beer get length compiles on x86:
0x000000010b8628c0: mov    $0x10b8628c0,%r8
0x000000010b8628ca: pushq $0x20
0x000000010b8628cf: push %r8
0x000000010b8628d1: sub $0x8,%rsp
0x000000010b8628d5: add $0x10,%r14
0x000000010b8628d9: mov $0x10a6c8186,%rax
0x000000010b8628e3: mov %rax,-0x8(%r14)
0x000000010b8628e7: mov $0x100026c60,%rax
0x000000010b8628f1: mov (%rax),%rcx
0x000000010b8628f4: mov %rcx,(%r14)
0x000000010b8628f7: callq 0x10b1e0de0
0x000000010b8628fc: add $0x18,%rsp
0x000000010b862900: mov $0x10b86290f,%rbx
0x000000010b86290a: jmpq 0x10b70f4a0

The first few instructions set up a call stack frame, then the beer symbol is pushed on the stack, and the (inlined) definition of get follows. Finally, the address of the jump instruction is loaded into the rbx register and a jump is performed. Initially, the jump points at the PIC miss stub, so the first time this machine code executes, the jump instruction's target will be changed to a newly-generated PIC.

However, the inline_cache_miss() function in the VM takes the call site address as an input parameter, and after all, it is written in C++. Where does this input parameter actually come from? The answer is that there are short assembly stubs that sit between the actual call site and the PIC miss code in the VM. The stubs get the return address in a CPU-specific manner, and then call the VM. This code is in the primitive_inline_cache_miss procedure, which is again defined in several places:

Note that the functions have two entry points. The first entry point is taken for non-tail call sites, the second one is taken for tail call sites.

An optimization


Since allocation of PICs is something that happens rather frequently, it is good to optimize this operation. I implemented segregated free lists for the code heap. Allocating lots of small code heap blocks requires less work now. Also, since every PIC is referenced from at most one call site, when a call site's PIC is regenerated with a new set of cached classes, the only PIC can be returned to the free list immediately. This reduces the frequency of full garbage collections (when the code heap fills up, a full GC must be performed; there is no generational tenuring for code blocks).

Megamorphic caching


If a call site is called with more than a few distinct classes of instances (the default maximum PIC size is 3 entries) then it is patched to call the megamorphic dispatch stub for that generic word. Megamorphic stubs are again generated in the VM, in the dispatch.cpp source file. The machine code that is generated, were it to be expressed in pseudo-C, looks like this:
void *obj = datastack[index];
void *class = class_of(obj);
if(cache[class] == class) goto cache[class + 1];
else megamorphic_cache_miss(class,generic,cache);

Every generic word has a fixed-size megamorphic cache of 32-entries (this can be changed by bootstrapping a new image, but I'm not sure there's any point). It works like a hashtable, except if there is a collision, the old entry is simply evicted; there is no bucket chaining or probing strategy here. While less efficient than a polymorphic inline cache hit (which only entails a direct jump), megamorphic cache hits are still rather quick (some arithmetic and an indirect jump). There are no calls into the VM, it is all inline assembly. A megamorphic inline cache miss calls the mega_cache_miss primitive which computes the correct method and updates the cache.

Performance


I added some instrumentation code to the VM which counts the number of cache hits and misses. The majority of call sites are monomorphic or polymorphic, and megamorphic call sites are very rare. Megamorphic cache misses are rarer still.

The performance gain for benchmark runtimes was as I expected; a 5-10% gain. The old method dispatch system was already pretty efficient, and modern CPUs have branch prediction which helped a lot with it. For compile time, the gain was a lot more drastic, however, and definitely made the effort I put into implementing PICs pay off; bootstrap is now a full minute faster than before, clocking in at around 3 minutes, and the load-all word, which loads and compiles all the libraries in the distribution, used to take 25 minutes and now takes 11 minutes. All timings are from my MacBook Pro.

The compile-time gain is not a direct result of the inline caching, but rather stems from the fact that the compiler has to compile less code. With the old method dispatch implementation, every generic word was an ordinary word under the hood, with a huge, auto-generated body containing conditionals and jump tables. When a method was added or removed, the dispatcher code had to be re-generated and re-compiled, which takes a long time. With polymorphic inline caching, what happens is now the VM essentially lazily generates just those parts of the dispatch code which are actually used, and since it uses the VM's JIT code instead of the optimizing compiler, it can just glue bits of machine code together instead of building an IR and optimizing it; which yielded no benefits for the type of code that appeared in a generic word body anyway.

Improving compile time is one of my major goals for Factor 1.0, since it improves user experience when doing interactive development, and I'm very happy with the progress so far.

Friday, May 22, 2009

Live status display for Factor's build farm, and other improvements

Lately I've made a few additional improvements to the build infrastructure. First, build reports sent to the Factor-builds mailing list are now formatted as HTML, which makes them a bit more readable. Second, there is a Twitter feed of binary uploads: @FactorBuilds. Finally, we now have a live status display for build machines. Previously if I wanted to know what a build machine was doing, I'd have to log in with ssh and poke around -- total waste of time. Now, just clicking on the OS/CPU combination on factorcode.org takes you to a status page. For example, here is the status page for Mac OS X/x86. At the time of this blog post, I see that this machine is running tests for a certain GIT revision. Each GIT revision is a link to the GitHub mirror of the Factor repository so you can see exactly what's going on. Finally, the latest build report can be viewed too; this has the build failures, if any, as well as benchmark timings.

Our continuous build system is called mason and adds up to a total of 1196 lines of code. Eduardo Cavazos wrote the first iteration, called builder. I forked it and added additional features.

It's been over a year since we started using continuous integration in the Factor project, and I can say its been an overwhelming success. The first iteration of the build system would load all libraries and run all unit tests, and send an email with the results. If everything succeeded, it would upload a binary package. Over the last year, more than 5000 binary packages were uploaded. Over time, we added more checks to the build farm. It now runs help lint checks which ensure that examples in documentation work and that there are no broken links, and checks for compiler errors.

I think the code quality has definitely gone up over the last year; having a dozen machines running tests all the time does a good job of triggering obscure bugs, and the automatic upload of binaries when tests pass saves us from the hassle of making manual releases.

I think pretty soon, we're going to start having releases again, in addition to continuous builds. I intend on making the release process semi-automatic; when I decide to do a release, I want to send some type of command to all the build machines to have them build a given GIT ID and upload the binaries to a special directory. The goal is to have regular releases until 1.0, starting from a few weeks from now. I'm not going to commit to a schedule for 1.0 yet, but having regular releases and published change logs is the first step of the process.

Thursday, May 21, 2009

Unboxed tuple arrays in Factor

One difference between high level languages and systems languages is that high level languages typically hide the notion of a pointer, and by extension, value versus reference semantics, from the programmer. Systems languages make this explicit, and the programmer can make a choice whether or not to pass an object by value or by reference.

In particular, this level of control can lead to significant space savings for large arrays of homogeneous data. If the elements of an array are themselves not shared by other structures, then storing values directly in the array will, at the very least, save on the space usage of a pointer, together with an object header.

I don't believe there is any reason for high-level languages to not offer more fine-grained control over heap storage. Factor is dynamically typed but it offers both arrays of unboxed numeric values, and arrays of tuples, which I will describe below. In some cases, using these special-purpose data types can offer significant performance and space gains over polymorphic arrays.

Consider a Factor tuple definition:
TUPLE: point x y ;

We can find out how much space points use in memory -- this is on a x86-64 machine:
( scratchpad ) point new size .
32

The x and y slots are as wide as a machine pointer, so together they use 16 bytes. This means that there is an additional 16-byte overhead for every instance. Indeed, Factor tuples store a header and a pointer to a layout object (shared by all instances of that class) in addition to the slot data itself. While this compares favorably to other dynamic languages (see this blog post for instance; Ruby in particular has really terrible object representation), it still adds up to a significant overhead when you have a large collection of points.

Suppose you have an array of N points. Each array element is a pointer (8 bytes) to a point object (32 bytes). Assuming the points are not referenced from any other collection, that's 40*N bytes for the entire collection. However, there is only 16*N bytes of real data, namely the x and y slots of every element, so we're wasting more than 50% of heap usage here, not to mention increasing GC overhead by allocating tons of small objects.

This overhead can be avoided, however, using Factor's tuple-arrays library. Tuple arrays implement the same sequence protocol as arrays, but the semantics differ. Instead of storing references to objects, they store object slots directly within the array. There are two main restrictions that result from this:
  • Tuple arrays are homogeneous; all elements are of the exact same type. This rules out setting some elements to f to represent missing data, or having elements which are instances of different subclasses of the same base class.
  • Tuple arrays implement value semantics. If you get an element out, and mutate it, you have to store it back into the array to update the original value in the array.

So how do tuple arrays work? The original implementation, by Daniel Ehrenberg defined a single tuple-array class where the constructor took a class word as a parameter, and reflection APIs were used to implement getting and setting elements. While this had all the space savings described above, constructing and deconstructing tuples via reflection is not the fastest thing in the world; even the most efficient possible implementation would require some indirection and type checking in order to work.

A better approach is to generate a new sequence type for each element type. This is what the current implementation does. Because all of Factor's sequence operations are written in terms of a sequence protocol, these generated classes are for the most part transparent to the programmer, and there is no added inconvenience over the previous approach. To use it, you invoke a parsing word with the element type class you want to use:
TUPLE-ARRAY: point

Assuming a class named point was previously defined, this generates a point-array class with associated methods.

Briefly, the implementation works as follows. The underlying storage of a tuple array is an ordinary array. If the tuple class has N slots, then every group of N elements in the underlying array is a single element in the tuple array. The nth and set-nth methods, which are generated once for every tuple array, read and write a group of N elements and package them into a tuple.

Now let's dive into the guts of the implementation. The parsing word is defined as follows:
SYNTAX: TUPLE-ARRAY: scan-word define-tuple-array ;

So it delegates all the hard work to a define-tuple-array word. This word is a functor, so tuple arrays work the same as specialized numeric arrays. Functors are similar to C++ templates in spirit, but really they are implemented entirely as a library, and it's all just syntax sugar for parsing word machinery (which is a lot more powerful than C++ templates). They're named "functors" to annoy category theory fanboys and language purists. A functor definition begins with FUNCTOR::
FUNCTOR: define-tuple-array ( CLASS -- )

A functor is essentially the same as a word defined with ::, but with additional syntax sugar. So CLASS here is a local variable available in the body of the functor. I use the convention of uppercase names for functor parameters. What follows FUNCTOR: is a list of clauses which create new words at parse time. The left hand side of each clause is a new local variable name that the generated word will be bound to, and the right hand side is an interpolate form describing what the new word should be named. The middle operator specifies whether or not the word should already exist (IS) or if it should be defined (DEFINED). So the definition of define-tuple-array proceeds as follows:
CLASS IS ${CLASS}

CLASS-array DEFINES-CLASS ${CLASS}-array
CLASS-array? IS ${CLASS-array}?

<CLASS-array> DEFINES <${CLASS}-array>
>CLASS-array DEFINES >${CLASS}-array

Next, a functor definition calls WHERE, which switches the parser to reading the functor body.

After that, what follows is a series of forms which look like word definitions, but really they parse as code which defines words at the time that the functor is executed. Both the word names and bodies may reference local variables defined in the first part of the functor:
WHERE

TUPLE: CLASS-array
{ seq array read-only }
{ n array-capacity read-only }
{ length array-capacity read-only } ;

: <CLASS-array> ( length -- tuple-array )
[ \ CLASS [ tuple-prototype <repetition> concat ] [ tuple-arity ] bi ] keep
\ CLASS-array boa ; inline

M: CLASS-array length length>> ;

M: CLASS-array nth-unsafe tuple-slice \ CLASS read-tuple ;

M: CLASS-array set-nth-unsafe tuple-slice \ CLASS write-tuple ;

M: CLASS-array new-sequence drop <CLASS-array> ;

: >CLASS-array ( seq -- tuple-array ) 0 <CLASS-array> clone-like ;

M: CLASS-array like drop dup CLASS-array? [ >CLASS-array ] unless ;

INSTANCE: CLASS-array sequence

So every time someone calls define-tuple-array (most likely with the TUPLE-ARRAY: parsing word) a new class is defined, together with a constructor word and some methods.

Finally, we end the functor with:
;FUNCTOR

To illustrate, this means that the following:
TUPLE-ARRAY: point

Is equivalent to the following:
TUPLE: point-array
{ seq array read-only }
{ n array-capacity read-only }
{ length array-capacity read-only } ;

: <point-array> ( length -- tuple-array )
[ \ point [ tuple-prototype <repetition> concat ] [ tuple-arity ] bi ] keep
\ point-array boa ; inline

M: point-array length length>> ;

M: point-array nth-unsafe tuple-slice \ point read-tuple ;

M: point-array set-nth-unsafe tuple-slice \ point write-tuple ;

M: point-array new-sequence drop <point-array> ;

: >point-array ( seq -- tuple-array ) 0 <point-array> clone-like ;

M: point-array like drop dup point-array? [ >point-array ] unless ;

INSTANCE: point-array sequence

Now imagine writing all of the above code out every time you want a specialized tuple array; that would amount to boilerplate. Of course, at this point, it still looks like runtime reflection is happening, because we're passing the point class word to the read-tuple and write-tuple words. However, what actually happens is that these words get inlined, then the fact that the class parameter is a constant triggers a series of macro expansions and constant folding optimizations that boil the code down to an efficient, specialized routine for packing and unpacking points from the array. If you're interested in the gory details, take a look at the tuple arrays source code.

So the space savings are clearly nice to have, but what about performance? Here is a program that uses polymorphic arrays:
TUPLE: point { x float } { y float } ;

: polymorphic-array-benchmark ( -- )
5000 [ point new ] replicate
1000 [
[
[ 1+ ] change-x
[ 1- ] change-y
] map
] times drop ;

Here is the same program using tuple arrays:
TUPLE: point { x float } { y float } ;

TUPLE-ARRAY: point

: tuple-array-benchmark ( -- )
5000 <point-array>
1000 [
[
[ 1+ ] change-x
[ 1- ] change-y
] map
] times drop ;

On my MacBook Pro, the first version using polymorphic arrays runs in 0.39 seconds, the second version using tuple arrays runs in 0.91 seconds. Clearly, there is some overhead from copying the objects in and out of the array.

Now, let's try a slightly different program. Instead of mutating the elements of the array, let's instead extract the slots, modify them, and make a new tuple. Here is a version using polymorphic arrays:
TUPLE: point { x float read-only } { y float read-only } ;

: polymorphic-array-benchmark ( -- )
5000 [ point new ] replicate
1000 [
[
[ x>> 1+ ] [ y>> 1- ] bi <point>
] map
] times drop ;

This version runs in 0.94 seconds; the additional object allocation induces a 3x overhead over mutating the points in place. Now, let's try the same thing with tuple arrays:
TUPLE: point { x float read-only } { y float read-only } ;

TUPLE-ARRAY: point

: tuple-array-benchmark ( -- )
5000 <point-array>
1000 [
[
[ x>> 1+ ] [ y>> 1- ] bi <point>
] map
] times drop ;

For this variant I'm getting runtimes of 0.59 seconds, which is faster than the same code with polymorphic arrays. The reason for the speed boost is that the Factor compiler's escape analysis pass kicks in, eliminating all tuple allocation from within the loop completely.

So to summarize, in order from fastest to slowest,
  • Polymorphic arrays with in-place mutation
  • Tuple arrays with read-only tuples
  • Polymorphic arrays with read-only tuples
  • Tuple arrays with in-place mutation

Finally, a note about benchmarking methodology. Because garbage collection can add an element of unpredictability to benchmarks, I force run the GC to run first:
( scratchpad ) gc [ tuple-array-benchmark ] time
== Running time ==

0.593989 seconds

The combination of a high-level language with efficient abstractions that the compiler knows how to optimize is very powerful. Factor occupies a middle ground between the dynamic languages and systems languages. I want Factor to be a language that you can write an entire application in, even the performance-critical parts. Often you hear that "performance doesn't matter", but outside of a few I/O-bound problem domains, this is false (and even there, you want to heavily optimize your I/O and text processing routines). Very often, people cite some variation of the "80/20 performance rule" (20% of your program runs 80% of the time; adjust percentages to suit your strawman). Of course, the idea is that you can write the 80% in a high level language, dropping down to C for the performance-critical 20%. I think this reasoning is flawed, for two reasons:
  • As my friend Cameron Zwarich likes to point out, very little justification is given for the 80/20 rule itself. For many programs, the performance profile is rather flat, and every indirection, every unnecessary runtime check, and every unnecessary runtime memory allocation adds up to a non-trivial amount of overhead. The "performance bottleneck" myth originated in a survey of Fortran programs which did numerical calculations. For numerics, it is indeed true that most of the runtime is spent in a handful of inner loops; whether or not this is true for ordinary programs is far less clear.
  • Even if runtime really is dominated by a small number of routines, chances are these routines are algorithmically complex, hard to write, and nice to experiment with interactively; precisely the domain where high-level languages with advanced abstractions shine.

I think it's about time dynamic language advocates stopped citing the same old myths to justify poor implementation techniques. Their energy would be better spent researching compiler optimizations and garbage collection techniques which they can apply in their favorite language implementations instead. Hats off to the V8, SquirrelFish, Unladen Swallow and SBCL projects for taking performance seriously.

Friday, May 08, 2009

Factor VM ported to C++

Most of Factor is implemented in Factor (the parser, data structures, optimizing compiler, input/output, the UI toolkit, etc) however a few bits and pieces are not completely self hosting. This includes the garbage collector, non-optimizing compiler (used for bootstrap), and various runtime services. This code comprises what I call the "Factor VM". Even though it is not a Virtual Machine in the classical sense, since there's no interpreter or bytecode format, it serves a similar role to the VM of scripting languages.

Ever since Factor lost its JVM dependency in mid-2004, the VM has been written in C with some GNU extensions. Over time, I've noticed that a few things in the VM source code seem to be hard to express and error-prone in straight C. I've been thinking about switching to C++ for a long time, and now I've finally decided to bite the bullet and do it. I'm pleased with the result, and other than a longer compile time for the VM, I don't have any complaints so far.

I started off by first getting the existing C code to compile with the GNU C++ compiler. Since C++ is almost, but not quite, a super set of C, I had to make a few changes to the source:
  • Renaming a few identifiers (I had local variables named class, template and new in some places)
  • Fixing some type safety issues (C will implicitly cast void* to any other pointer type, whereas C++ will not)
  • Moving global variable declarations to source files, and changing the declarations in the headers to be extern
  • Making a few functions extern "C" since they are called from code generated by the Factor compiler

Once this was done and everything was working, I started the process of refactoring my C pain points using C++ language features and idioms. This process is far from complete; at the end of the blog post I will outline what remains to be done.

Miscellaneous improvements


While the Factor VM is pretty small, around 13,000 lines of code right now, being able to use C++ namespaces is still nice since I've had name clashes with system C headers in the past. Now everything is wrapped in a factor namespace, so this should occur less in the future. Also, I changed a bunch of #define to inline functions and constants.

Tagged pointers


Factor is dynamically typed. While the optimizing compiler infers types and eliminates type checks sometimes, this generates machine code directly and mostly sidesteps the VM entirely. As far as the VM is concerned, it must be possible to determine the type of a value from its runtime representation. There are many approaches to doing this; the one I've been using from the start is "tagged pointers".

Here is how it works: in the Factor VM, objects are always allocated on 8-byte boundaries in the data heap. This means every address is a multiple of 8, or equivalently, the least significant 3 bits of a raw address are zero. I take advantage of this to store some type information in these bits. A related trick that the Factor VM does is to reserve a tag for small integers that fit in a pointer; so a 29-bit integer (on 32-bit platforms) or a 61-bit integer (on 64-bit platforms) does not need to be heap-allocated.

With 3 tag bits, we get a total of 8 unique tags. Factor has more than 8 data types, though, and indeed the user can define their own data types too. So there are two tags used to denote that the data type is actually stored in the object's header. One of these is used for built-in VM types that don't have a unique tag number, another one is used for user-defined tuple class instances.

Since C and C++ do not have native support for tagged pointers, the VM needs a mechanism to convert tagged pointers to untagged pointers, and vice versa. When converting a tagged pointer to an untagged pointer, a type check is performed.

In the C VM, I had code like the following:
#define TAG_MASK 7
#define UNTAG(tagged) (tagged & ~TAG_MASK)

typedef struct { ... } F_ARRAY;

#define ARRAY_TYPE 2

F_ARRAY *untag_array(CELL tagged)
{
type_check(ARRAY_TYPE,tagged);
return (F_ARRAY *)UNTAG(tagged);
}

Here, CELL is typedef'd to an unsigned integer as wide as void*, and UNTAG is a macro which untags a pointer by masking off the low 3 bits (7 decimal is 111 binary).

To convert untagged pointers to tagged form, I used code like the following:
#define RETAG(untagged,tag) (((CELL)(untagged) & ~TAG_MASK) | (tag))

CELL tag_array(F_ARRAY *untagged)
{
return RETAG(untagged,ARRAY_TYPE);
}

For hi-tag types, such as strings, I used a single tag_object() function:
CELL tag_object(void *untagged)
{
return RETAG(untagged,OBJECT_TYPE);
}

The OBJECT_TYPE constant was misleadingly named, since everything in Factor is an object. However, in the VM it means "an object with type info in its header".

The old C code for tagging and untagging pointers was mostly adequate, however it suffered from a few issues. First, there was a lot of boilerplate; each tag type has an untag/tag function pair, and each hi-tag type had an untag function. Furthermore, every time I changed a hi-tag type into a tag type, I had to find usages of tag_object() and change them appropriately. Over time I developed several preprocessor macros to clean up this type of stuff but I was never happy with it.

In C++, the whole thing came out much cleaner. The UNTAG and RETAG macros are still around (although I may change them to inline functions at some point). However, for tagging and untagging pointers, I use a single set of template functions. To tag an untagged pointer:
template <typename T> cell tag(T *value)
{
return RETAG(value,tag_for(T::type_number));
}

That's it; one function for all data types. It is used like so:
array *a = ...;
cell c = tag(a);

How does it work? The F_ARRAY struct is now the array class, and it has a static variable type_number:
struct array : public object
{
static const cell type_number = 2;
...
}

How do hi-tag pointers get handled? Well, their type numbers are all greater than or equal to 8. The tag_for() function checks for this, and returns the object tag number of it is the case. Since it is an inline function, the computation is constant-folded at compile time, and the generated code is as efficient as the old C version, only more type-safe and with less boilerplate.

For untagging tagged pointers, I use a more complicated scheme. I have a templated "smart pointer" class representing a tagged pointer with an associated data type. Here is a simplified version of the class; this omits the type checking logic and a bunch of other utility methods; if you want the gory details, the full source is in tagged.hpp.
template <typename T>
struct tagged
{
cell value_;
T *untagged() const { return (T *)(UNTAG(value_)); }
T *operator->() const { return untagged(); }
explicit tagged(T *untagged) : value_(factor::tag(untagged)) { }
}

The overload of operator-> means that tagged pointers can be used like ordinary pointers on some cases without needing to be untagged first. Otherwise, a pointer stored in a cell can be untagged like so:
array *a = tagged(c).untagged()

I made a utility function to encapsulate this type of thing, for when I just want to untag a value once and not pass it around as a smart pointer:
template  T *untag(cell value)
{
return tagged(value).untagged();
}

Now I can write
array *a = untag<array>(c)

This compiles down to the same machine code as the C version but it is more type-safe and less verbose.

Registration of local variables with the garbage collector


Any object allocation can trigger a GC, and the GC must be able to identify all live values at that point. For compiled Factor code, this is pretty easy. The GC needs to scan a few globals, the data and retain stacks, as well as the call stack; call frames for Factor code have a known format. The compiler does store objects in registers, but it moves them to the stack before calling into the GC.

However, the VM code itself uses the garbage collector to allocate its own internal state. This makes the VM code more robust since it doesn't need to worry about when to deallocate memory, but it presents a problem: if a C local variable points into the GC heap, how does the GC know that the value is live, and how does the GC know to update the pointer of the value got moved (Factor's GC is a generational semi-space collector so objects move around a lot). Between function calls, compiled C code saves pointers in the call frame. However, since I have no control over the machine code gcc generates, I cannot force call stack frames into a certain format that the GC can understand, like I do with compiled Factor code.

One solution is conservative garbage collection. The garbage collector can just pessimistically assume that any integer variable on the call stack might point into the GC heap. There are two problems with this approach. The first is if you have a bit pattern that happens to look like a pointer, but is really just a random number, you might keep a value live more than it is really needed. From what I've heard, this is mostly just a theoretical concern. A more pressing issue is that since you do not know what is a real pointer and what isn't, you cannot move objects that might be referenced from the call stack. This can lead to fragmentation and more expensive allocation routines.

The technique I've been using until now in the C VM code was to have a pair of macros, REGISTER_ROOT and UNREGISTER_ROOT. These push and pop the address of a local variable on a "GC root stack". So by traversing the GC root stack, the garbage collector knows which call stack locations refer to objects in the GC heap, and which are just integers. The disadvantage of this approach is that pairing root registration was verbose and error-prone, and forgetting to regsiter a root before calling a function which may potentially allocate memory would lead to hard-to-discover bugs that would be triggered rarely and result in memory corruption.

Using C++ language features, I instead implemented a "smart pointer" gc_root class. It subclasses the tagged smart pointer I described above, with the additional functionality in the constructor and destructor to register and unregister a pointer to the wrapped pointer as a GC root. Together with the overloaded operator-> this makes for much clearer and safer code. By updating existing code to use this new smart pointer I found a few places in the old C code where I had forgotten to register roots. No doubt one or two users saw Factor crash because of these. You can look at the source code for the GC root abstraction.

Object-oriented abstractions


Most functions in the VM are still top-level C-style functions, however in a few places I've found good reason to use methods and even inheritance. I've avoided virtual methods so far. One place where inheritance has really helped clean up some crufty old C code is in what I call the JIT. This is a simple native code generator that the VM uses to fill in the gaps for the optimizing compiler that's written in Factor. The JIT is used to generate machine code in the following situations:

The old design was built out of C macros and code duplication. Now there is a jit class with various methods to emit machine code. Subclasses of jit, such as quotation_jit and inline_cache_jit, add additional functionality. The constructor and destructor take care of registering and unregistering the objects used by the JIT with the GC.

Functionality moved out of the VM


Since I prefer programming in Factor over low-level languages such as C and C++, I try to keep the VM small. A couple of things were in the VM for historical reasons only. The complex number and rational number data types, for instance, were defined in the VM, even though operations on them were in Factor code. This is because the number tower predated user-defined tuple types by a year or so. Nowadays there is no reason not to use tuples to define these types, so I made the change, simplifying the VM. Another piece of functionality that no longer had a good reason to be in the VM was some vestigial code for converting Factor's Unicode string type to and from Latin-1 and UCS-2 encodings. Ever since Daniel Ehrenberg implemented the io.encodings library in Factor, most string encoding and decoding was done in library code, and a lot more encodings were implemented; UTF8, Shift-JIS, MacRoman, EBCDIC, etc. Only a couple of things still relied on the VM's broken built-in string encoding. They're now using the library just like everything else.

Code heap compaction


Factor allocates compiled machine code in their own heap, separate from the data heap. A mark/sweep/compact garbage collector manages the code heap. When performing a compaction, the code heap garbage collector needs to maintain an association from a code block's old address to its new address. Previously this would be stored in the code block's header, wasting 4 bytes (or 8 bytes on 64-bit platforms) per code block. Now I use std::tr1::unordered_map to do this. Of course I could've written a hashtable in C, or used an existing implementation, but using STL is more convenient.

Future improvements


I don't spend much time working on the Factor VM, since most of the action is in the compiler, library, and UI, which are all written in Factor and not C++. However, at some point I'm going to do more work on the garbage collector, and I will do some more cleanups at that time:
  • Eliminate usages of GNU C extensions. I want to be able to compile the Factor VM with other compilers, such as Microsoft's Visual Studio, and LLVM's Clang.
  • Eliminating global variables
  • Using new/delete instead of malloc() and free() for the small set of data that the VM allocates in unmanaged memory
  • Various other cleanups, such as making better use of method invocation and templates
  • There's some code duplication in the garbage collector I hope to address with C++ templates


Conclusion


A lot of these cleanups and improvements could have been achieved in C, with clever design and various preprocessor macro tricks. However, I think C++ offers a better set of abstractions for the problem domain that the Factor VM lives in, and since I only use the language features that have no runtime penalty (I'm not using C++ exceptions, virtual methods, or RTTI) there's no performance decrease either. I'm very happy with how the code cleanup turned out. Making the VM easier to understand and maintain will help with implementing more advanced garbage collection algorithms, among other things.

Saturday, April 18, 2009

Recent UI improvements

Over the last few months I've done a lot of work on Factor's UI toolkit, and the developer tools built on top. Here is an overview of what's changed.

Code cleanup


I've been working on the UI since 2005, and it has accumulated a lot of cruft since then. Language features have come and gone and not enough refactoring was done over time. Last year I rewrote the HTTP server and compiler using the latest language features and the result was very clean code. The UI is now written in the same style.

Improved font support


I've blogged about this:

All UI gadgets that render text now use the appropriate platform-specific mechanism for rendering text. The ui.text vocabulary provides the cross-platform abstraction layer. The fonts vocabulary defines data types for fonts and font metrics. This replaces ad-hoc loosely-typed "font specifiers" that the UI used formerly, where a font was just a triple; { "sans-serif" plain 12 } for instance. Font metrics are now supported in a cross-platform manner too. All relevant gadgets now do baseline alignment.

Image support


The UI now supports an easy API for displaying images. The ui.images vocabulary defines some words which can be called from your gadget's draw-gadget* method. The ui.gadgets.icons vocabulary defines a simple gadget that renders an image and nothing else. This is all built on top of the images vocabulary, which implements BMP and TIFF loaders in pure Factor. Hopefully we'll get PNG and JPEG at some point too.

Editor gadget improvements


The ui.gadgets.editors vocabulary supports some new features:
  • Undo and redo (Control-z, Control-Shift-z)
  • Join lines (Control-j)
  • Character and word navigation now uses unicode.breaks

Table gadget replaces list gadget


The ui.gadgets.tables vocabulary implements a new gadget which is a replacement for the old list gadget. It's functionality is a superset of what lists offered. Tables are used extensively by the new developer tools to display completion popups and such. Unlike lists, which would create a gadget for every row of data, tables do not have any children and render rows directly. This reduces memory consumption. Tables take the list of rows from an underlying model, and place the currently selected row in another model. The latter model can then be wrapped in an arrow model to compute a new list of rows, which can then be the model for another table. This way, multiple tables can be chained together for a 'drill down navigation'-style of interaction with very little code.

Scroller gadget improvements


Scroller gadgets, defined in the ui.gadgets.scrollers vocabulary, now provide a protocol that their children can implement. This protocol has two generic words:
  • viewport-pref-dim - this allows the child to specify the preferred dimensions of the scroller surrounding the gadget. Tables and editors implement this generic word, allowing you to set various slots, such as min-rows and max-rows, which control the size of the scroller surrounding the gadget in character units.
  • viewport-column-header - constructs a gadget which is displayed at the top of the scroller. This allows the table gadget to display column titles which are always visible, regardless of the vertical scroll position.


Frame layout improvements


The ui.gadgets.frames vocabulary is more flexible now. Instead of limiting frame layouts to a 3x3 grid with the center item stretched to fill available space, the grid can be of any size, and an arbitrary cell can be designated to take up available space.

Models API changes


The models vocabulary has undergone some cleanups. To avoid confusion with sequence functionality, filter models are now called arrow models. Compose models are now called product models, since really they represent a cartesian product of functions, not a composition of anything at all. The new models.arrow.smart vocabulary wrap an arrow around a product whose arity is the arity of the arrow's quotation.

Integration between event loop and I/O


Until Factor supports real native threads, any FFI call blocks all Factor threads until the call completes. I/O is done with non-blocking APIs on Unix (epoll, kqueue) and Windows (IO completion ports), so socket operations can proceed concurrently. However, until recently, the UI would have to poll for events because of this, because blocking calls to wait for events would prevent I/O and other Factor threads from running. This meant that even if you weren't doing anything in the Factor UI, it would use a little bit of CPU - 5% to 25%. This problem has been fixed. On Mac OS X, I use the CFFileDescriptor API to wait on a kqueue and GUI events at the same time; you can provide a callback which is invoked when file descriptors are ready for I/O, and call [NSApplication run] as usual. On X11, I use XConnectionNumber() to get a file descriptor out of a Display and call wait-for-fd to wait for events to arrive while also allowing other Factor threads to run. This solves the CPU usage problem and allows Factor threads to run concurrently with the event loop waiting for events.

Improvements to tooling


The main focus of my recent UI work, however, has been on improving UI developer tools. I plan on making a screencast to highlight the new improvements soon.

Wednesday, April 08, 2009

OpenGL textures and the power-of-two size restriction

Prior to version 2.0, the OpenGL specification required that texture dimensions be powers of two. This simplifies the implementation of texture mapping, because converting floating point texture co-ordinates (which are in the range 0..1) to texel coordinates is trivial; multiplying a floating point number by a power of two is essentially just adding a number to the exponent. Eventually, more capable drivers and graphics cards came along, and introduced the ability to use non-power-of-two texture dimensions. To signal this capability to GL applications, they report supporting the GL_ARGB_texture_non_power_of_two extension. OpenGL 2.0 implementations are required to support this extension.

In practice, the only major OpenGL implementations which don't provide this extension are older X11 drivers, and the Microsoft Windows software renderer, which is a very bare-bones OpenGL 1.1 implementation.

There is a trick for padding textures up to a power of two for implementations which don't support this extension, however it doesn't seem to work everywhere either. Instead of manipulating the bitmap in software before passing it onto a call to glTexImage2D(), it is permissible as of OpenGL 1.1 to pass a bitmap pointer of NULL. This creates a texture with uninitialized content. The glTexSubImage2D() function is used to fill it portions of the new texture. In particular, glTexSubImage2D() places no restriction on the width and height, even if GL_ARGB_texture_non_power_of_two is not supported.

The above trick works with the Windows software renderer. On the other hand, previous-generation MacBooks with Intel graphics suffer from a driver bug which results in artifacts appearing when this feature is used to render scaled textures. However, all OpenGL implementations in recent Mac OS X releases support non-power-of-2 textures, so on this platform, the workaround can be avoided entirely anyway.

In Factor, the opengl.capabilities vocabulary provides some utility words to check for extensions. For example, a common operation is checking for either a specific OpenGL version, or an extension (new versions of the GL spec frequently absorb existing extensions):
"2.0" { "GL_ARB_texture_non_power_of_two" } has-gl-version-or-extensions?

The gl-extensions word outputs a sequence of all supported extensions. Here is the output from the Mesa software renderer on Linux.

I replaced my old code in opengl.textures which padded bitmap image objects out to powers of two using sequence manipulation words, with the new way using texture sub-images instead. If the extension is present, no padding is performed, ensuring correct behavior on Mac OS X. This means that any code using opengl.textures, such as the UI's text rendering and image support, should now spend less CPU time running Factor code.

Factor's OpenGL binding has been in development for 4 years and has seen contribution from 4 developers. For a demo of what it can do, try "spheres" run in the UI listener. You will need a video driver that supports OpenGL 2.0 or GL_ARB_shader_objects.