Thursday, May 21, 2009

Unboxed tuple arrays in Factor

One difference between high level languages and systems languages is that high level languages typically hide the notion of a pointer, and by extension, value versus reference semantics, from the programmer. Systems languages make this explicit, and the programmer can make a choice whether or not to pass an object by value or by reference.

In particular, this level of control can lead to significant space savings for large arrays of homogeneous data. If the elements of an array are themselves not shared by other structures, then storing values directly in the array will, at the very least, save on the space usage of a pointer, together with an object header.

I don't believe there is any reason for high-level languages to not offer more fine-grained control over heap storage. Factor is dynamically typed but it offers both arrays of unboxed numeric values, and arrays of tuples, which I will describe below. In some cases, using these special-purpose data types can offer significant performance and space gains over polymorphic arrays.

Consider a Factor tuple definition:
TUPLE: point x y ;

We can find out how much space points use in memory -- this is on a x86-64 machine:
( scratchpad ) point new size .
32

The x and y slots are as wide as a machine pointer, so together they use 16 bytes. This means that there is an additional 16-byte overhead for every instance. Indeed, Factor tuples store a header and a pointer to a layout object (shared by all instances of that class) in addition to the slot data itself. While this compares favorably to other dynamic languages (see this blog post for instance; Ruby in particular has really terrible object representation), it still adds up to a significant overhead when you have a large collection of points.

Suppose you have an array of N points. Each array element is a pointer (8 bytes) to a point object (32 bytes). Assuming the points are not referenced from any other collection, that's 40*N bytes for the entire collection. However, there is only 16*N bytes of real data, namely the x and y slots of every element, so we're wasting more than 50% of heap usage here, not to mention increasing GC overhead by allocating tons of small objects.

This overhead can be avoided, however, using Factor's tuple-arrays library. Tuple arrays implement the same sequence protocol as arrays, but the semantics differ. Instead of storing references to objects, they store object slots directly within the array. There are two main restrictions that result from this:
  • Tuple arrays are homogeneous; all elements are of the exact same type. This rules out setting some elements to f to represent missing data, or having elements which are instances of different subclasses of the same base class.
  • Tuple arrays implement value semantics. If you get an element out, and mutate it, you have to store it back into the array to update the original value in the array.

So how do tuple arrays work? The original implementation, by Daniel Ehrenberg defined a single tuple-array class where the constructor took a class word as a parameter, and reflection APIs were used to implement getting and setting elements. While this had all the space savings described above, constructing and deconstructing tuples via reflection is not the fastest thing in the world; even the most efficient possible implementation would require some indirection and type checking in order to work.

A better approach is to generate a new sequence type for each element type. This is what the current implementation does. Because all of Factor's sequence operations are written in terms of a sequence protocol, these generated classes are for the most part transparent to the programmer, and there is no added inconvenience over the previous approach. To use it, you invoke a parsing word with the element type class you want to use:
TUPLE-ARRAY: point

Assuming a class named point was previously defined, this generates a point-array class with associated methods.

Briefly, the implementation works as follows. The underlying storage of a tuple array is an ordinary array. If the tuple class has N slots, then every group of N elements in the underlying array is a single element in the tuple array. The nth and set-nth methods, which are generated once for every tuple array, read and write a group of N elements and package them into a tuple.

Now let's dive into the guts of the implementation. The parsing word is defined as follows:
SYNTAX: TUPLE-ARRAY: scan-word define-tuple-array ;

So it delegates all the hard work to a define-tuple-array word. This word is a functor, so tuple arrays work the same as specialized numeric arrays. Functors are similar to C++ templates in spirit, but really they are implemented entirely as a library, and it's all just syntax sugar for parsing word machinery (which is a lot more powerful than C++ templates). They're named "functors" to annoy category theory fanboys and language purists. A functor definition begins with FUNCTOR::
FUNCTOR: define-tuple-array ( CLASS -- )

A functor is essentially the same as a word defined with ::, but with additional syntax sugar. So CLASS here is a local variable available in the body of the functor. I use the convention of uppercase names for functor parameters. What follows FUNCTOR: is a list of clauses which create new words at parse time. The left hand side of each clause is a new local variable name that the generated word will be bound to, and the right hand side is an interpolate form describing what the new word should be named. The middle operator specifies whether or not the word should already exist (IS) or if it should be defined (DEFINED). So the definition of define-tuple-array proceeds as follows:
CLASS IS ${CLASS}

CLASS-array DEFINES-CLASS ${CLASS}-array
CLASS-array? IS ${CLASS-array}?

<CLASS-array> DEFINES <${CLASS}-array>
>CLASS-array DEFINES >${CLASS}-array

Next, a functor definition calls WHERE, which switches the parser to reading the functor body.

After that, what follows is a series of forms which look like word definitions, but really they parse as code which defines words at the time that the functor is executed. Both the word names and bodies may reference local variables defined in the first part of the functor:
WHERE

TUPLE: CLASS-array
{ seq array read-only }
{ n array-capacity read-only }
{ length array-capacity read-only } ;

: <CLASS-array> ( length -- tuple-array )
[ \ CLASS [ tuple-prototype <repetition> concat ] [ tuple-arity ] bi ] keep
\ CLASS-array boa ; inline

M: CLASS-array length length>> ;

M: CLASS-array nth-unsafe tuple-slice \ CLASS read-tuple ;

M: CLASS-array set-nth-unsafe tuple-slice \ CLASS write-tuple ;

M: CLASS-array new-sequence drop <CLASS-array> ;

: >CLASS-array ( seq -- tuple-array ) 0 <CLASS-array> clone-like ;

M: CLASS-array like drop dup CLASS-array? [ >CLASS-array ] unless ;

INSTANCE: CLASS-array sequence

So every time someone calls define-tuple-array (most likely with the TUPLE-ARRAY: parsing word) a new class is defined, together with a constructor word and some methods.

Finally, we end the functor with:
;FUNCTOR

To illustrate, this means that the following:
TUPLE-ARRAY: point

Is equivalent to the following:
TUPLE: point-array
{ seq array read-only }
{ n array-capacity read-only }
{ length array-capacity read-only } ;

: <point-array> ( length -- tuple-array )
[ \ point [ tuple-prototype <repetition> concat ] [ tuple-arity ] bi ] keep
\ point-array boa ; inline

M: point-array length length>> ;

M: point-array nth-unsafe tuple-slice \ point read-tuple ;

M: point-array set-nth-unsafe tuple-slice \ point write-tuple ;

M: point-array new-sequence drop <point-array> ;

: >point-array ( seq -- tuple-array ) 0 <point-array> clone-like ;

M: point-array like drop dup point-array? [ >point-array ] unless ;

INSTANCE: point-array sequence

Now imagine writing all of the above code out every time you want a specialized tuple array; that would amount to boilerplate. Of course, at this point, it still looks like runtime reflection is happening, because we're passing the point class word to the read-tuple and write-tuple words. However, what actually happens is that these words get inlined, then the fact that the class parameter is a constant triggers a series of macro expansions and constant folding optimizations that boil the code down to an efficient, specialized routine for packing and unpacking points from the array. If you're interested in the gory details, take a look at the tuple arrays source code.

So the space savings are clearly nice to have, but what about performance? Here is a program that uses polymorphic arrays:
TUPLE: point { x float } { y float } ;

: polymorphic-array-benchmark ( -- )
5000 [ point new ] replicate
1000 [
[
[ 1+ ] change-x
[ 1- ] change-y
] map
] times drop ;

Here is the same program using tuple arrays:
TUPLE: point { x float } { y float } ;

TUPLE-ARRAY: point

: tuple-array-benchmark ( -- )
5000 <point-array>
1000 [
[
[ 1+ ] change-x
[ 1- ] change-y
] map
] times drop ;

On my MacBook Pro, the first version using polymorphic arrays runs in 0.39 seconds, the second version using tuple arrays runs in 0.91 seconds. Clearly, there is some overhead from copying the objects in and out of the array.

Now, let's try a slightly different program. Instead of mutating the elements of the array, let's instead extract the slots, modify them, and make a new tuple. Here is a version using polymorphic arrays:
TUPLE: point { x float read-only } { y float read-only } ;

: polymorphic-array-benchmark ( -- )
5000 [ point new ] replicate
1000 [
[
[ x>> 1+ ] [ y>> 1- ] bi <point>
] map
] times drop ;

This version runs in 0.94 seconds; the additional object allocation induces a 3x overhead over mutating the points in place. Now, let's try the same thing with tuple arrays:
TUPLE: point { x float read-only } { y float read-only } ;

TUPLE-ARRAY: point

: tuple-array-benchmark ( -- )
5000 <point-array>
1000 [
[
[ x>> 1+ ] [ y>> 1- ] bi <point>
] map
] times drop ;

For this variant I'm getting runtimes of 0.59 seconds, which is faster than the same code with polymorphic arrays. The reason for the speed boost is that the Factor compiler's escape analysis pass kicks in, eliminating all tuple allocation from within the loop completely.

So to summarize, in order from fastest to slowest,
  • Polymorphic arrays with in-place mutation
  • Tuple arrays with read-only tuples
  • Polymorphic arrays with read-only tuples
  • Tuple arrays with in-place mutation

Finally, a note about benchmarking methodology. Because garbage collection can add an element of unpredictability to benchmarks, I force run the GC to run first:
( scratchpad ) gc [ tuple-array-benchmark ] time
== Running time ==

0.593989 seconds

The combination of a high-level language with efficient abstractions that the compiler knows how to optimize is very powerful. Factor occupies a middle ground between the dynamic languages and systems languages. I want Factor to be a language that you can write an entire application in, even the performance-critical parts. Often you hear that "performance doesn't matter", but outside of a few I/O-bound problem domains, this is false (and even there, you want to heavily optimize your I/O and text processing routines). Very often, people cite some variation of the "80/20 performance rule" (20% of your program runs 80% of the time; adjust percentages to suit your strawman). Of course, the idea is that you can write the 80% in a high level language, dropping down to C for the performance-critical 20%. I think this reasoning is flawed, for two reasons:
  • As my friend Cameron Zwarich likes to point out, very little justification is given for the 80/20 rule itself. For many programs, the performance profile is rather flat, and every indirection, every unnecessary runtime check, and every unnecessary runtime memory allocation adds up to a non-trivial amount of overhead. The "performance bottleneck" myth originated in a survey of Fortran programs which did numerical calculations. For numerics, it is indeed true that most of the runtime is spent in a handful of inner loops; whether or not this is true for ordinary programs is far less clear.
  • Even if runtime really is dominated by a small number of routines, chances are these routines are algorithmically complex, hard to write, and nice to experiment with interactively; precisely the domain where high-level languages with advanced abstractions shine.

I think it's about time dynamic language advocates stopped citing the same old myths to justify poor implementation techniques. Their energy would be better spent researching compiler optimizations and garbage collection techniques which they can apply in their favorite language implementations instead. Hats off to the V8, SquirrelFish, Unladen Swallow and SBCL projects for taking performance seriously.

8 comments:

Wolf550e said...

I don't know Factor (or Forth) to really understand the implementation, but why isn't the trivial modify-in-place benchmark faster with an array of structs than with an array of pointers to structs?

There is one more indirection in the array of pointers case, right?

To modify a slot you modify a float (4 aligned bytes) at a compile-time-known offset from the object's address. The address is either the pointer from the array in one case or a known offset in the array in the second case. Right?

Mainly, why isn't this as fast as C if you specify types (float) and the objects (tuples) are defined and sealed (unlike ruby's braindead thing).

Slava Pestov said...

Wolf: in Factor's tuple arrays, when you get an element out, a copy is made; you don't get a pointer into the middle of the data array. This explains why tuple arrays have more runtime overhead than polymorphic arrays. In the benchmark using read-only tuples, the compiler's escape analysis eliminates the unnecessary copy, which is why that case is faster.

Wolf550e said...

Thanks for the explanation!

C# has structs (as opposed from classes) that have value semantics. I think an array of those is like an array of structs in C, i.e. you can modify a slot in place instead of allocating a temporary, initializing it with a copy constructor, modifying a slot and copying the whole thing back into the array.

Is the temporary object at least stack allocated?

Why can't (or won't) you make a version that does allow you to get a pointer to the slot inside the array? At first I thought of atomicity but that's not it I think.

Slava Pestov said...

Wolf: because it would complicate the garbage collector. I'm going to keep improving the optimizer so that the copies can be eliminated in more cases, though.

riffraff said...

you said ruby has a terrible representation of objects, but isn't ruby1.9's representation pretty similar to factors? (header+pointer to per-class layout object)

Great article as usual :)

zimbatm said...

Nice article Slava.

I spotted a typo in the second benchmark example,
there should be a < point-array> after the 5000

Dunan said...

I think, one day factor will be fast enough to became fully self-hosted.

Slava Pestov said...

riffraff: I base my claim on how Ruby 1.8 implements objects; they're just hashtables.

zimbatm: thanks, I forgot to escape a <, fixed