Saturday, August 29, 2009

Struct arrays benchmark revisited: trig function calls are slow in Java, but without them Factor is still 3x faster

My struct arrays benchmark generated a fair amount of discussion on reddit, with some people disputing the benchmark's validity. Certainly I'm not claiming that Factor is faster than Java HotSpot in general (on most tasks it is slower, sometimes much more so), however I think the benchmark legitimately demonstrates the performance advantage of value semantics over reference semantics.

A few people pointed out that the Java version of the benchmark was spending a lot of time in trigonometric functions, and that Java computes sin and cosine "by hand" rather than using x87 FPU instructions. However, the same is also true of the Factor implementation, its just that none of the Java advocates bothered to check. I don't use x87 instructions either, and call into libc for sin and cos, just like Java. Indeed, Factor's sin and cos are even more heavyweight than Java's, because they also support complex numbers; Factor's compiler converts them to their real-valued equivalents if it can prove the inputs are real.

Despite this, I modified the benchmark to not call trig functions, and instead just initialize each point as (n+1,n+2,(n+3)*(n+3)/2). Factor is still faster by roughly the same margin as before:
Java829ms (best run out of 8)

A few people also mentioned that by only running the test 8 times I wasn't giving HotSpot enough time to "warm up". However, the benchmark does not make any polymorphic calls in the inner loops, so there's really no "warm up" needed at all; and indeed I got the best time on the third iteration.

I improved Factor's code generation for trigonometric function calls. Trig function calls are treated like instructions by the low-level optimizer now, which means they participate in value numbering and do not split basic blocks. Instead, the register allocator spills all live registers at the call site at the very end of code generation. This strategy does not work in general for all FFI calls, because the case where the FFI calls invokes a Factor callback needs additional runtime support. However, neither sin nor cos do this. After implementing this, I noticed that my struct-arrays benchmark was calling sin() twice on the same input, and value numbering was folding the second call away. Neat optimization to get "for free".

This code generation change improves performance of both the struct-arrays and partial-sums benchmarks:

Here is the disassembly for benchmark.struct-arrays with this optimization performed, and this is the patch.

For what its worth, Java HotSpot runs the partial-sums benchmark in 1293ms. Hopefully HotSpot's trig function call performance will receive some attention at some point.

Friday, August 28, 2009

Performance comparison between Factor and Java on a contrived benchmark

Recently, Joe Groff has been working on struct classes, with the aim of completely replacing Factor's existing support for C structure types. The interesting thing about Joe's struct classes is that unlike the clunky old FFI struct support which operated on byte arrays, struct classes really are Factor classes; instances look and feel like Factor objects, there's literal syntax and prettyprinter support, and structure fields are accessed using accessor words that look exactly like Factor tuple slot accessors.

So whereas the old C structure support didn't have much use outside of the FFI, the new struct classes become a useful and type-safe abstraction in themselves. Tuples are composed of dynamically-typed slots which reference Factor values, and structs are composed of scalar data stored in contiguous memory. This makes them useful in code that otherwise does not use the FFI.

What makes struct classes even more useful in performance-sensitive code is the related struct-arrays vocabulary. This presents a sequence of struct values, stored in contiguous memory, as a Factor sequence. This allows ordinary high-level sequence operations and combinators (map, filter, change-each, ...) to be used on scalar data with a very efficient in-memory representation.

It is interesting that Factor's core is rather high-level and dynamically typed, but various libraries built off to the side using meta-programming facilities implement features useful for systems programming, such as specialized array types, allowing binary pointerless data to be represented and manipulated efficiently. This is unprecedented in dynamic languages, where the usual approach is to farm out performance-intensive work to another language.

The performance difference between, say, an array of pointers to boxed floats, and a raw array of floats, cannot be understated. Further levels of structure make the difference even more dramatic: an array of pointers to objects where each one has three fields pointing to boxed floats (what a mouthful), is considerably less efficient to work with than an array of structures with three float fields each. In the former case, a great many objects are allocated and pointers traversed while working with the data; in the latter case, all data can be stored in one contiguous memory block, that appears as a simple byte array to the garbage collector.

Dynamic languages with advanced JITs, such as the recent JavaScript implementations, hit a performance barrier imposed by their in-memory object representations. Even Java, which has a reputation of being very fast as far as managed, GC'd languages go, suffers here. While Java can pack scalar data into a single instance (so if an object has three float fields, the float data is stored directly in the object); however it does not offer arrays of objects where the objects are stored directly in the array. This negatively impacts performance, as you will see.

There is a buzzword that describes this approach of dealing with data: value semantics. If you find some whiny C++ programmers, soon enough one of them will mumble something about value semantics, and with good reason: because C++ offers value semantics, it often has a performance edge over other programming languages. While production-ready Java and C++ implementations both implement a similar set of optimizations, language semantics contribute to C++ being faster for a lot of code. Of course, C++ is low-level and has unsafe pointers; however, as Factor demonstrates, you can have a high-level managed language that still provides support for value semantics in a safe way.

I decided to whip up a simple benchmark. Here is what it does:
  • It works on points, which are triplets of single-precision floats, (x,y,z)
  • First, the benchmark creates a list of 5000000 points for i=0..4999999, where the ith point is (sin(i),cos(i)*3,sin(i)*sin(i)/2).
  • Then, each point is normalized; the x, y, and z components are divided by sqrt(x*x+y*y+z*z).
  • Finally, the maximum x, y, and z components are found, for all points, and this is printed out.

Note that in-place operations, and re-using temporary objects, is allowed.

Here is the code:

The Factor version is both shorter, and has more blank lines.

Note that the Factor version is intended to be run as a vocabulary from the Factor environment, using the time word, as follows:
[ "benchmark.struct-arrays" run ] time

Run it a few of times so that the data heap can grow to a stable size.

The Java code is self-contained; run it with
java -Xms512m -server silly

The Java version runs the benchmark 8 times, and prints the best time; this gives the HotSpot Server JIT a chance to 'warm up'.

The JVM shipped with Mac OS X 10.5 (build 1.5.0_19-b02-304) runs the benchmark in 4.6 seconds, and the Factor version ran in 2.2 seconds using a Factor build from earlier this evening. I made a couple of further improvements to the Factor compiler, bringing the runtime of the Factor version of the benchmark down to 1.4 seconds in the latest source tree:
  • Added intrinsics for min and max so that when they are applied to values known to be floating point at compile-time, the SSE2 instructions MINSD and MAXSD are used.
  • Added some optimizations to unbox intermediate <displaced-alien> values constructed by the struct array code. A displaced alien is a value representing a pointer and an offset; it is a heap-allocated object with two slots. This is how struct classes internally implement the backing store for objects which are stored 'in the middle' of a byte array.

The Factor code is several layers of abstraction removed from the low-level machine code that is generated in the end. It takes the following optimizations to get good performance out of it:
  • All words declared inline are inlined, all higher-order functions are inlined.
  • Literal quotations (written in square brackets) are inlined at their call sites.
  • Macros, such as sum-outputs are expanded.
  • Sequence words, struct field accessors, and generic math operations are all replaced with lower-level type-specific equivalents using type inference; in many cases, these operations, such as adding floats, map directly to machine instructions
  • The incrementing counter for initializing points is converted into a fixed-precision integer because interval and def-use analysis determine it is safe to do so
  • Escape analysis determines that various temporary objects can be stack-allocated:
    • closures created by various combinators, such as tri-curry, each, and so on
    • the struct array object created with <struct-array>
    • the struct class instances created inside the many calls to each
    • the struct created at the end to store the maximum value
  • Of the remaining memory allocations, those that create small objects are completely inlined, and do not call into the VM at all
  • Stack analysis eliminates most of the runtime data stack manipulation in favor of keeping values in CPU registers
  • Representation analysis figures out that all of the intermediate float values can be unboxed and stored in floating point registers, except for this that live across the sin/cos calls
  • While the sin and cos functions result in calls into libc, sqrt is open-coded as an SSE2 instruction
  • Value numbering eliminates redundant pointer arithmetic and the boxing/unboxing of pointer values
  • GC checks are open-coded and inlined at every basic block which performs an allocation; the GC check compares the nursery pointer against the limit, and in the fast case, where the nursery is not full, no subroutine call is made, and no registers need to be saved and restored
  • The coalescing pass eliminates unnecessary register to register copy instructions that arise from SSA form, as well as ensuring that in most cases, the result of an arithmetic instruction is the same as the first operand; this helps x86 instruction selection
  • The register allocator assigns virtual registers to real CPU registers; in this case there is no spilling at all on x86-64

So what does this say about language expressiveness? Any dynamic language that a) offers the same set of metaprogramming tools as Factor b) has a reasonably advanced JIT could do the same. On the other hand, working value semantics into something like Java is pretty much impossible. I invite you implement this benchmark in your favorite language and share your timings.

Monday, August 24, 2009

New tool for locating external resource leaks

While having garbage collection solves the problem of manually deallocating a block of memory when its lifetime ends, it doesn't help with external resources, such as sockets and file handles, at all. And even in a GC'd language, memory has to be managed manually sometimes, for example when passing data to and from a C library using FFI. So Factor has had a generic disposal protocol, as well as destructors, for quite some time. What was missing was tooling support.

I wanted to build a tool that would help me debug code that was leaking external resources by forgetting to dispose them. Thankfully, this doesn't come up often; as C++ programmers using RAII like to note, scoped destructors solve resource management in 90% of all cases, and Factor's resource management combinators are even more flexible. However, sometimes an external resource can have a complex lifetime, because of caching, pooling, and other advanced idioms. In these cases, having tools to help track leaks down can really help.

I took inspiration from Doug Coleman's managed malloc and free implementation. Whereas some languages use finalizers to ensure that an external resource gets cleaned up if you forget to dispose of it explicitly, I take the opposite approach; all active disposable objects are stored in a central registry, explicitly preventing the GC from cleaning them up.

The new tools.destructors vocabulary introduces two words, disposables. and leaks. The first word prints a tally of all active resources:

Clicking 'list instances' next to a disposable class opens an inspector window with all active instances of this resource; for example, here we can list all file descriptors that Factor has open right now:

You can even dispose of the resource right there, as if you had called the dispose word on it:

The leaks combinator compares active resources before and after a quotation runs, and lists any that were not disposed of. For example, in the following, I construct a file-reader, read a line, and never dispose of it:

Notice how a file descriptor, together with a few associated resources, was leaked as a result of this.

If I wanted to, I could click on 'show instances' and dispose of the input-port in the inspector; this would dispose of the other two associated objects as well. Also, opening the inspector on a disposable resource that was allocated inside a leaks call will display a stack trace, showing where in the code the resource was allocated. This can help pinpoint the offending code.

Of course, the correct way to read a line from a file in Factor is to use a combinator which cleans up the stream for you. In the following example, the resource leak has been fixed:

To define a new disposable resource, simply create a tuple class that subclasses disposable, make sure to construct it with new-disposable, and override the dispose* method.

Here is an example that manages a limited pool of "X"s, of which there are 100 total:

100 >vector xs set-global

: get-x ( -- x ) xs get pop ;
: return-x ( x -- ) xs get push ;

TUPLE: my-disposable-resource < disposable x ;

: <my-disposable-resource> ( -- disposable )
my-disposable-resource new-disposable get-x >>x ;

M: my-disposable-resource dispose* x>> return-x ;

The disposable resource machinery is built with very little code, and it is used throughout Factor already.

Status of the PowerPC port, and Snow Leopard's imminent release

Factor has had a PowerPC port since some time in 2004, when I got it running on my old G3 iMac, and I continued working on it after I switched to a Power Mac G5 as my primary workstation. However, since then, Apple decided to go with Intel chips, and I sold my G5, so I no longer own any PowerPC gear. As a result, the Factor PowerPC port has been neglected for a while now. Right now, the code in the master branch of the git repository bootstraps on PowerPC and basic things seem to work, but there are stability issues (it doesn't manage to finish a full continuous integration run). While I'll probably fix this particular issue and ensure that up to date binaries get uploaded, I can't help but wonder, now that OS X 10.6 is around the corner (without PowerPC support), is it worth my time to keep the PowerPC port going?

In addition to stability issues, the PowerPC port lags behind a bit as far as code quality goes. It doesn't do instruction scheduling, nor does it make use of some unique PowerPC features that would help improve performance (multiple condition registers, etc). AltiVec isn't used at all.

I have a couple of questions for the community:
  • Does anyone still care about PowerPC at all? I might be out of the loop here; is it about as relevant nowadays as DEC Alpha, or PA-RISC? Other than the game consoles and IBM's big iron gear, who uses it?
  • Is there any interest at all in a Linux/PowerPC port of Factor?
    I guess over time, people who still own older PowerPC-based Macs are going to switch to Linux, as Apple gradually ends issuing security updates and so on for PowerPC systems
  • Is anyone interested in taking over development of Factor's PowerPC support? It's only 1600 lines of code in basis/cpu/ppc, the assembler is simple to use and implemented declaratively, and the compiler's backend interface is well abstracted and consists of a few dozen small assembly templates. A lot of contributors have more than enough Factor experience to take this one on, I think. If you know some assembly language already, compiler backends are really not that complex in Factor, it just takes time to test them thoroughly.
  • Should I forget about PowerPC altogether, and spend the time that I would spend on it on reviving the Factor/ARM port instead? Last time I messed around with ARM (a few years ago) contemporary devices were pretty short on RAM; nowadays 256Mb RAM is starting to become common, and Factor would run much better.

Any comments on this matter are appreciated.

Monday, August 10, 2009

Global float unboxing and some other optimizations

In compiler implementation, "global" optimizations are those that apply to an entire procedure at a time, in contrast to "local" optimizations, which operate on individual basic blocks only. After some work, float unboxing is the latest optimization which is now done globally in Factor.

Factor's float unboxing optimization used to be part of local value numbering. Originally I planned on implementing global value numbering and trying to do float unboxing there, but since then I discovered a better way. I also improved the SSA coalescing pass I talked about last time, and implemented some optimizations which help compile the code in the math.vectors vocabulary more efficiently.

Global float unboxing

Instead of modeling float unboxing as an algebraic simplification problem (where unbox(box(x)) can be rewritten as x) I now model it as a representation problem, ie, the optimizer makes a global decision, for every virtual register, whether to store it as a tagged pointer, or an unboxed float.

When the low-level IR is built, no boxing or unboxing operations are inserted, and the compiler assumes all floating point values will end up unboxed. The new compiler.cfg.representations pass is responsible for inserting just enough boxing and unboxing operations to make the resulting code valid. This pass consists of several step.

Representations and register classes

A representation is like a low-level data type associated to a virtual register. At this stage, possible repesentations are tagged pointers, single-precision floats, and double-precision floats. Note that fixnums (which are stored directly inside a pointer, rather than being heap allocated) are considered as tagged pointers, and that single-precision floats are only used by the FFI. In the future, I will also have a representation type for untagged integers (so that Factor code can perform arithmetic on 32 and 64-bit numbers without boxing them) and various SSE vector types.

A register class is a set of machine registers that instructions can operate on. Every representation has a register class, and values in this representation must be stored in registers of that class. Multiple representations can share a register class; for example, single and double precision floats all go into floating point registers. The register allocator deals with register classes.

When the low-level IR is built, virtual registers do not have representations assigned to them. Instead, this mapping is established later, by the compiler.cfg.representations pass.

First step: compute possible representations for each virtual register

The first step to deciding what representation to use for a virtual register is to take stock of the possibilities. The compiler.cfg.representations.preferred vocabulary defines words which output, for each instruction, the preferred representation of the input values, and the preferred representation of the output value, if any.

For example, the ##slot instruction, which accesses an object's slot, prefers that its inputs and outputs be tagged pointers. On the other hand, ##add-float prefers that its inputs and outputs are unboxed floats (for otherwise, boxing is required), and ##integer>float takes an integer but outputs an unboxed float.

This information is collected into a single mapping, from virtual registers to sequences of representations.

Second step: computing costs for different representations

The next step is to compute the cost of keeping each virtual register in every possible representation for that register. The compiler uses a simple cost heuristic. The cost of keeping a register in a given representation R is obtained by iterating over all instructions that use that register, but expect to receive the value in a different representation from R. For each such instruction, its loop nesting depth is added up. Finally, if the representation R differs from the output representation of the instruction that computed the register, the loop nesting is added to the tally. This gives an approximation of the overhead of boxing and unboxing the value.

To calculate the loop nesting depth of an instruction, I use the standard natural loop detection algorithm. This is implemented in compiler.cfg.loop-detection. The details can be found in any moderately advanced compiler textbook; for each basic block, it identifies a set of loops this basic block belongs to (this set may be empty). A loop consists of a header, a set of more or more end blocks, and a set of blocks making up this loop. Multiple loops can share a header, but each loop can only have a single header because all control flow graphs that the low-level optimizer constructs are reducible (at least I think so).

Third step: representation selection

Once costs have been computed, it is easy to assign to each virtual register the representation that has the least cost. There is one subtlety here though; if an instruction outputs a tagged pointer, no other representation may be used for that virtual register. For example, in this code, it would not be sound to pick an unboxed float representation for the input parameter, since in one branch, it is used as a fixnum, and trying to perform a float unboxing on a fixnum will just crash:
: silly-word ( x -- y )
{ [ dup fixnum? ] [ 1 + ] }
{ [ dup float? ] [ 3 * ] }
} cond ;

The problem here is that in low-level IR, the ##peek instruction which loads the value from the data stack outputs a tagged value, and a priori the compiler cannot assume its going to be a boxed float that can be safely unboxed -- only the output of a float-producing instruction can be unboxed. A more complicated unboxing algorithm based on control-dependence was suggested to me by Daniel Ehrenberg. However, because the low-level IR is single-assignment, the potential overhead from the current unboxing scheme is not so great, and so I'm not worried about it right now.

Fourth step: inserting boxing and unboxing instructions

Once the compiler has determined the representation to use for each virtual register, it must now iterate over all instructions again, this time, inserting conversions whenever a virtual register is used with a representation other than the one that was chosen for it. This pass introduces new temporaries, and renames virtual registers used and defined in instructions, in order to preserve single assignment. Other than that, it is relatively straightforward.

Live-range coalescing revisited

Last time I mentioned that I had implemented a rather complex algorithm for converting out of SSA form and doing copy coalescing at the same time. After having some problems getting the algorithm working in certain corner cases, I decided to scrap it and implement something simpler instead. My new SSA coalescing algorithm uses the classical interference graph approach, except it doesn't actually have to build the interference graph, since interference testing on SSA form can be done in O(1) time.

The register allocator used to do its own coalescing, to eliminate copies inserted by the two-operand conversion pass. The two-operand pass converts
x = y + z

x = y; x = x + z

(and similarly for all other arithmetic operations). This makes arithmetic operations fit the x86 instruction encoding scheme, where the destination register is also the first register. Implementing copy coalescing twice, in two different ways, seemed inelegant to me. A big breakthrough was when I was able to get my SSA coalescing algorithm working on a relaxed SSA form, where values can be reassigned more than once, but only within the same basic block where they are defined. This means that the two-operand conversion pass can run before SSA coalescing, and the register allocator does not need to perform coalescing at all.

The new coalescing algorithm was not noticably slower than the old one (it still benefits from not having to construct an interference graph), the code is much simpler, and the register allocator's copy coalescing code is gone. The compiler feels cleaner as a result.

Improved tuple unboxing

I implemented escape analysis for tuple unboxing last year and recently came up with a way to improve it. Whereas previously, only freshly-allocated tuples could be unboxed, I realized that if an input parameter to a word is known to be an instance of an immutable tuple class, and the input parameter does not escape (ie, not passed to a subroutine call, or returned) then it can be unboxed, too. Whereas unboxing a freshly-allocated tuple involves deleting the allocation call, and replacing slot access with references to the inputs to the allocation call, unboxing an input parameter is a bit trickier. Actual code has to be inserted at the beginning of the word, which unpacks the tuple. This optimization is only a win if the same slot is accessed more than once within a word; in particular, its a win if the slot access occurs inside a loop. This situation came up with specialized arrays; a specialized array is a tuple with two slots, a byte array, and a length. To access the nth element of a specialized array involves an indirection, because the byte array has to be retrieved. By unboxing the specialized array before a loop, a slot access is eliminated from the loop. This gives the high-level optimizer a form of loop-invariant code motion (real loop-invariant code motion might appear in the low-level optimizer at some point, too).

Vector operations on specialized arrays

I used to use hints to make the math.vectors vocabulary faster. While vector words could be used on any sequence type, hints would automatically compile a specialized version of each vector word for the double-array type. This gave a nice speedup on benchmarks which performed vector operations on arrays of double-precision floats, but it didn't help at all for vector operations on other specialized arrays, and there was still at least one runtime dispatch per vector operation, to determine if the generic or hinted version should be used.

While hints have their place, I don't use them for vector words anymore. Instead, I cooked up some special compiler extensions for them (easier than it sounds; the compiler is pretty extensible, at least the pass that I needed to extend for this). Now, every specialized array type compiles a specialized set of vector words, and if the input types of a vector word are known at compile time, the right specialized version is used instead. Of course there's no actual code duplication here, the code is generated at runtime.

Benchmark results

The above improvements, as well as the changes I outlined in my previous post, give a nice speedup since the last time I did some benchmarks.
BenchmarkMay 31stJuly 17Aug 10
benchmark.backtrack 1.767561 1.330641 1.111011
benchmark.base64 1.997951 1.738677 1.536447
benchmark.beust1 2.765257 2.461088 2.360084
benchmark.beust2 3.584958 1.694427 1.691499
benchmark.binary-search 1.55002 1.574595 1.585475
benchmark.binary-trees 1.845798 1.733355 1.771849
benchmark.bootstrap1 10.860492 11.447687 9.962971
benchmark.dawes 0.229999 0.161726 0.104693
benchmark.dispatch1 2.015653 2.119268 2.252401
benchmark.dispatch2 1.817941 1.216618 1.629185
benchmark.dispatch3 2.568987 1.899128 2.692071
benchmark.dispatch4 2.319587 2.032182 2.073563
benchmark.dispatch5 2.346744 1.614045 1.369721
benchmark.empty-loop-0 0.146716 0.12589 0.12608
benchmark.empty-loop-1 0.430314 0.342426 0.351241
benchmark.empty-loop-2 0.429012 0.342097 0.350182
benchmark.euler150 16.901451 15.288867 13.046828
benchmark.euler186 8.8054349999999997.920478 7.997483
benchmark.fannkuch 3.202698 2.964037 2.859117
benchmark.fasta 5.52608 4.934112 5.135706
benchmark.gc0 2.15066 1.993158 2.03638
benchmark.gc1 4.984841 4.961272 5.075487
benchmark.gc2 3.327706 3.265462 3.376618
benchmark.iteration 3.736756 3.30438 3.22603
benchmark.javascript 9.79904 9.164517 9.000165000000001
benchmark.knucleotide 0.282296 0.251879 0.231547
benchmark.mandel 0.125304 0.123945 0.123321
benchmark.md5 0.946516 0.85062 0.84516
benchmark.nbody 3.982774 3.349595 2.204085
benchmark.nested-empty-loop-10.116351 0.135936 0.053609
benchmark.nested-empty-loop-20.692668 0.438932 0.390032
benchmark.nsieve 0.714772 0.698262 0.694443
benchmark.nsieve-bits 1.451828 0.907247 0.727394
benchmark.nsieve-bytes 0.312481 0.300053 0.225218
benchmark.partial-sums 1.205072 1.221245 1.152942
benchmark.random 2.574773 2.706893 2.403702
benchmark.raytracer 3.481714 2.914116 2.428384
benchmark.recursive 5.964279 3.215277 3.178855
benchmark.regex-dna 0.132406 0.093095 0.08619400000000001
benchmark.reverse-complement 3.811822 3.257535 3.003156
benchmark.ring 1.756481 1.79823 1.696271
benchmark.sha1 2.267648 1.473887 1.43003
benchmark.sockets 8.7942800000000018.783398 8.643011
benchmark.sort 0.421628 0.363383 0.367602
benchmark.spectral-norm 3.830249 4.036353 3.277558
benchmark.stack 2.086594 1.014408 0.91959
benchmark.sum-file 0.528061 0.422194 0.383183
benchmark.tuple-arrays 0.127335 0.103421 0.08908199999999999
benchmark.typecheck1 0.876559 0.6723440000000001 0.744205
benchmark.typecheck2 0.878561 0.671624 0.770056
benchmark.typecheck3 0.86596 0.670099 0.699071
benchmark.ui-panes 0.426701 0.372301 0.369152
benchmark.xml 2.351934 2.187999 1.630145