Factor: a practical stack language: November 2008

Sunday, November 30, 2008

New features in SMTP and SSL library allow sending e-mail through Gmail

Although I added SSL support to Factor's I/O library recently, it was missing a feature needed by SMTP-TLS: upgrading insecure connections to secure ones. Unlike HTTPS, where the client connects to a special port number initiates a secure handshake as soon as the connection is established, SMTP uses a STARTTLS command, which is sent in plain-text, and handshaking only begins after this command is sent.

To accomodate this use-case, I added two new words to the io.sockets.secure vocabulary: send-secure-handshake and accept-secure-handshake. They are meant to be used in a client and server respectively, wishing to upgrade the current connection to a secure connection.

I also updated the smtp vocabulary to support the STARTTLS and AUTH PLAIN features of the SMTP protocol. The end result is that the SMTP vocabulary can now send e-mail through Gmail's SMTP servers. An example is shown below. The key thing to notice here is that we set the smtp-auth and smtp-tls? variables before sending the e-mail. As you can see, the SMTP library is easy to use and the code for sending an e-mail is succinct; we create an e-mail object, fill in some slots, and send it off:

"smtp.gmail.com" 587 <inet> smtp-server set
smtp-tls? on
"YourUserName@gmail.com" "YourPassword" <plain-auth> smtp-auth set

<email>
    "yourname@example.com" >>from
    { "theirname@example.com" } >>to
    "We should probably beer" >>subject
    "For teh win" >>body
send-email

The SMTP library is still missing some functionality, like support for character encodings, attachements and setting custom headers; I'll be adding those over time.

Saturday, November 29, 2008

Faster generic arithmetic and integer overflow checking

While the Factor compiler does a pretty good job of inferring types and eliminating dispatch where possible, runtime polymorphism still ends up being used a lot. Integer arithmetic presents an additional challenge. Factor supports arbitrary-precision integers, and it tries to make this support transparent. However, the flipside of this is that even if the compiler can figure out that all of the values in an expression are integers, it may not be able to eliminate the overflow check, depending on whether or not the ranges of the values are known. Because of this, making runtime polymorphism efficient is still important. In the last few days, I worked on making integer arithmetic more efficient.

Consider a word like +. If compile-time type information is not available, this word has to look at the types of the top two stack items, and take appropriate action depending on whether they are fixnums, bignums, ratios, floats, or complex numbers. Until now, this dispatch was implemented by two indirect jumps. The type of the first object is used to key a dispatch table, where each entry then looks at the type of the second object and keys into a second-level dispatch table; the entries in the latter are the actual methods implementing the arithmetic operation.

In Factor, the generic word system generates Factor code for performing method dispatch. If you look at the definition of a word such as + in an older build, you'll see this two-level jump table pattern, implemented with Factor's dispatch combinator:

( scratchpad ) \ + def>> .
[
    over tag {
        [
            dup tag {
                [ fixnum=>+ ]
                [ [ >bignum ] dip bignum=>+ ]
                [ \ + no-math-method ]
                [ \ + no-math-method ]
                [ ratio=>+ ]
                [ [ >float ] dip float=>+ ]
                [ complex=>+ ]
                [ \ + no-math-method ]
            } dispatch
        ]
        [
            dup tag {
                [ >bignum bignum=>+ ]
                [ bignum=>+ ]
                [ \ + no-math-method ]
                [ \ + no-math-method ]
                [ ratio=>+ ]
                [ [ >float ] dip float=>+ ]
                [ complex=>+ ]
                [ \ + no-math-method ]
            } dispatch
        ]
        [ \ + no-math-method ]
        [ \ + no-math-method ]
        [
            dup tag {
                [ ratio=>+ ]
                [ ratio=>+ ]
                [ \ + no-math-method ]
                [ \ + no-math-method ]
                [ ratio=>+ ]
                [ [ >float ] dip float=>+ ]
                [ complex=>+ ]
                [ \ + no-math-method ]
            } dispatch
        ]
        [
            dup tag {
                [ >float float=>+ ]
                [ >float float=>+ ]
                [ \ + no-math-method ]
                [ \ + no-math-method ]
                [ >float float=>+ ]
                [ float=>+ ]
                [ complex=>+ ]
                [ \ + no-math-method ]
            } dispatch
        ]
        [
            dup tag {
                [ complex=>+ ]
                [ complex=>+ ]
                [ \ + no-math-method ]
                [ \ + no-math-method ]
                [ complex=>+ ]
                [ complex=>+ ]
                [ complex=>+ ]
                [ \ + no-math-method ]
            } dispatch
        ]
        [ \ + no-math-method ]
    } dispatch
]

The dispatch combinator compiles down to an indirect jump, and indirect jumps are expensive on modern CPUs. Also, note that this dispatch scheme did not favor any single numeric type over another. However, in practice, operations on bignums and ratios will always be slow, and dispatch overhead is neglegible there; whereas code that works with floating point numbers and complex numbers can only be fast if the compiler can infer types and unbox values statically, in which case there is no runtime dispatch at all. So performance of runtime arithmetic dispatch on types other than fixnums is largely irrelevant in Factor. Indeed, the overwhelming majority of calls to generic arithmetic operations involve fixnums, so some kind of trick to avoid two indirect jumps in this common case is needed.

There are many solutions, but the one I settled on is pretty much the simplest one. In Factor, all values are tagged pointers (except for unboxed floats and such, which the compiler uses, but they cannot be "observed" in their unboxed state from the outside). A tagged pointer is a value where the low 3 bits encode a type, and the remaining bits encode a value. If the type tag is 0, the value is interpreted as an integer; otherwise, it is a pointer. All values in the heap are aligned on 8 byte boundaries, so if we mask off the 3 tag bits, we get a valid pointer.

So if we have two values, stored in registers x and y, then both have a tag of 0 if their inclusive bitwise or has a tag of zero. That is, we can check if both are fixnums with a single conditional.

Here is the new Factor code generated for the dispatch performed by +:

( scratchpad ) \ + def>> .
[
    both-fixnums?
    [ { fixnum fixnum } declare fixnum=>+ ] [
        over tag {
            [
                dup tag {
                    [ { fixnum fixnum } declare fixnum=>+ ]
                    [
                        { fixnum bignum } declare
                        [ >bignum ] dip bignum=>+
                    ]
                    [
                        { fixnum tuple } declare
                        \ + no-math-method
                    ]
                    [ \ + no-math-method ]
                    [ { fixnum ratio } declare ratio=>+ ]
                    [
                        { fixnum float } declare [ >float ] dip
                        float=>+
                    ]
                    [ { fixnum complex } declare complex=>+ ]
                    [
                        { fixnum POSTPONE: f } declare
                        \ + no-math-method
                    ]
                } dispatch
            ]
            [
                dup tag {
                    [
                        { bignum fixnum } declare
                        >bignum bignum=>+
                    ]
                    [ { bignum bignum } declare bignum=>+ ]
                    [
                        { bignum tuple } declare
                        \ + no-math-method
                    ]
                    [ \ + no-math-method ]
                    [ { bignum ratio } declare ratio=>+ ]
                    [
                        { bignum float } declare [ >float ] dip
                        float=>+
                    ]
                    [ { bignum complex } declare complex=>+ ]
                    [
                        { bignum POSTPONE: f } declare
                        \ + no-math-method
                    ]
                } dispatch
            ]
            [ \ + no-math-method ]
            [ \ + no-math-method ]
            [
                dup tag {
                    [ { ratio fixnum } declare ratio=>+ ]
                    [ { ratio bignum } declare ratio=>+ ]
                    [
                        { ratio tuple } declare
                        \ + no-math-method
                    ]
                    [ \ + no-math-method ]
                    [ { ratio ratio } declare ratio=>+ ]
                    [
                        { ratio float } declare [ >float ] dip
                        float=>+
                    ]
                    [ { ratio complex } declare complex=>+ ]
                    [
                        { ratio POSTPONE: f } declare
                        \ + no-math-method
                    ]
                } dispatch
            ]
            [
                dup tag {
                    [
                        { float fixnum } declare
                        >float float=>+
                    ]
                    [
                        { float bignum } declare
                        >float float=>+
                    ]
                    [
                        { float tuple } declare
                        \ + no-math-method
                    ]
                    [ \ + no-math-method ]
                    [ { float ratio } declare >float float=>+ ]
                    [ { float float } declare float=>+ ]
                    [ { float complex } declare complex=>+ ]
                    [
                        { float POSTPONE: f } declare
                        \ + no-math-method
                    ]
                } dispatch
            ]
            [
                dup tag {
                    [ { complex fixnum } declare complex=>+ ]
                    [ { complex bignum } declare complex=>+ ]
                    [
                        { complex tuple } declare
                        \ + no-math-method
                    ]
                    [ \ + no-math-method ]
                    [ { complex ratio } declare complex=>+ ]
                    [ { complex float } declare complex=>+ ]
                    [ { complex complex } declare complex=>+ ]
                    [
                        { complex POSTPONE: f } declare
                        \ + no-math-method
                    ]
                } dispatch
            ]
            [ \ + no-math-method ]
        } dispatch
    ] if
]

Notice that it is essentially the same as the old version, except the two-level jump table has been wrapped in a conditional. If both-fixnums? outputs true, we go directly to the fixnum case; otherwise we do the two-level dispatch.

Since both-fixnums? needs to bitwise OR pointers together, I had no choice but to make it a VM primitive. The non-optimizing compiler inlines a fixed code sequence for calls of both-fixnums?, and the optimizing compiler expands it into a series of low-level IR operations:

( scratchpad ) [ both-fixnums? [ "Fixnums!" ] [ "At least one is not a fixnum" ] if ] test-mr mr.
=== word: ( gensym ), label: ( gensym )

_label 0 
##prologue 
_label 1 
##peek V int-regs 692262 D 0 
##peek V int-regs 692263 D 1 
##copy V int-regs 692264 V int-regs 692262 
##or V int-regs 692264 V int-regs 692264 V int-regs 692263 
##copy V int-regs 692265 V int-regs 692264 
##and-imm V int-regs 692265 V int-regs 692265 7 
_compare-imm-branch 3 V int-regs 692265 0 cc/= 
_label 2 
##inc-d 1 
##load-indirect V int-regs 692269 "Fixnums!" 
##replace V int-regs 692269 D 0 
##epilogue 
##return 
_label 3 
##inc-d 1 
##load-indirect V int-regs 692270 "At least one is not a fixnum" 
##replace V int-regs 692270 D 0 
_label 4 
##epilogue 
##return

So the machine code generated for a generic arithmetic word such as + now begins as follows:

    mov    (%r14),%rax
    mov    -0x8(%r14),%rcx
    or     %rcx,%rax
    and    $0x7,%rax
    cmp    $0x0,%rax
    jne    slow_dispatch
    ... handle fixnum+fixnum here ...
    ret
slow_dispatch:
    ... do table dispatch here ...

We load two values from the data stack, or them together, mask off the low 3 bits, compare with zero, and jump. Indeed, those readers who know the ways of x86 assembly will recognize that there is a shorter code sequence which can accomplish this same task:

    mov    (%r14),%rax
    mov    -0x8(%r14),%rcx
    or     %rcx,%rax
    test   $0x7,%rax
    jnz    slow_dispatch
    ... handle fixnum+fixnum here ...
    ret
slow_dispatch:
    ... do table dispatch here ...

Once I figure out a nice way to incorporate the test instruction into my code generator and value numbering passes, I will switch things up and eliminate the unnecessary instruction there.

This takes care of making dispatch faster. The next challenge was speeding up integer overflow handling. If the optimizing compiler's high-level optimizer can prove that a call to + has fixnum inputs, and in addition, the result will not overflow -- either because the ranges of the input values are constrained, or the output is passed to bitand -- it will replace it with a call to the fixnum+fast primitive. The low-level optimizer turns a call to fixnum+fast into a single low-level IR instruction, and this becomes a single machine instruction. Very efficient.

However, if the high-level optimizer cannot prove that the result will not overflow, but it does know the inputs are fixnums, it replaces the call to + with a call to fixnum+. This still avoids dispatch overhead, however, until now, the low-level optimizer did not do anything special for fixnum+; it would compile it as a subroutine call of a VM function, which added two integers, checking for overflow, in C. This was slow, because C does not have a way to test the overflow flag of the CPU.

I made the low-level optimizer aware of the fixnum+ word; it converts it into a low-level instruction which can then participate in value numbering and register allocation. The CPU-specific backend is then responsible for generating an optimal machine code sequence for fixnum+. On x86 and PowerPC, this can use the overflow flag in the CPU, which is a more efficient than trying to simulate the same in C. If the overflow flag is set after the addition, fixnum+ performs a subroutine call into the VM, however there is no call and the arithmetic is done inline in the common case, where there is no overflow.

Here is an example of machine code emitted for fixnum+:

    sub    $0x8,%r14
    mov    (%r14),%rax
    mov    0x8(%r14),%rcx
    add    %rcx,%rax
    jno    no_overflow
    ... handle overflow ...
no_overflow:
    mov    %rax,(%r14)
    retq

I omitted the lengthy setup for the subroutine call into the VM, and register assignments will vary depending on the placement of fixnum+ in the code.

The results were very interesting. Some benchmarks, such as spectral-norm and mandelbrot, showed no improvement, because the compiler eliminates all the dispatch and overflow checks from the inner loops anyway. Other benchmarks showed an impressive gain. This is good because it means that I didn't waste my time with implementing the above, but it is also bad because it shows that the high-level optimizer could probably do a better job of inferring types in some of these benchmarks!

Here are the results; I'm running each benchmark 5 times and measuring the runtime in milliseconds.

Benchmark	Before	After
fasta	9492 8985 8975 9310 9219	7841 7818 8115 7807 7777
reverse-complement	5815 5611 5628 5645 5645	5176 5152 5177 5173 5289
binary-trees	2911 2867 2870 2890 2894	2163 2142 2186 2168 2136
fib2	113 112 114 112 112	82 83 84 85 85
fib3	301 302 301 300 301	260 261 259 260 260
project-euler.134	427 427 427 426 424	314 313 315 318 313

Wednesday, November 26, 2008

Some recent UI rendering fixes

Over the last week or so I've fixed several OpenGL rendering problems in the Factor UI.

The first problem is that on Linux, with certain ATI video cards, calling glPolygonMode() to disable polygon fill does not work, and as a result the UI would render with no text visible; borders around buttons would be drawn as filled rectangles and this would overwrite all of the text. I fixed this and modernized the UI rendering code to use vertex arrays at the same time.

Then, I decided to address a long-standing issue: rectangles and lines were not rendered pixel-perfect on all systems. You see, with OpenGL, simply rendering a rectangle with integer co-ordinates doesn't give you the results you would expect. On my old Windows laptop for example, rectangles would render with the bottom-right pixel missing. Getting this to work in a cross-platform manner is surprisingly hard. It took me a lot of back-and-forth to get consistent rendering across the machines I tested on (An Intel iMac from a year ago or so, my MacBook Pro with NVidia graphics, Linux and Windows in VirtualBox, and an old Dell Laptop with Windows XP and Intel integrated graphics). For the record, the "correct" way to draw an outlined rectangle is to offset the top-left pixel by 0.5,0.5, and the bottom-right pixel by -0.3,-0.3. Similar heroics were required for line rendering. In the end, I made my job semi-automated by writing a simple program, which is in the ui.render.test vocabulary now, that renders some 2D shapes with OpenGL;

The vocabulary renders the shapes and then compares the results with a reference drawing. It reads the rendered pixels with glReadPixels() and loads the drawing with Doug's graphics.bitmap library.

Of course, it turned out that vertex arrays themselves have issues on at least one OpenGL implementation. I started to receive reports of random segfaults on older MacBooks with the Intel X3100 graphics chip. Chris Double then came across a mailing list post which explained the situation. Turns out that with this driver, rendering a GL_LINE_LOOP vertex array causes a segfault. The workaround is to render a GL_LINE_STRIP where the last vertex is a copy of the first vertex.

There are still a couple of OpenGL problems I haven't been able to resolve. A few users report random screen corruption on Windows machines with Intel graphics. There is also a problem where the Factor UI comes up black if used with Compiz on Linux, however this appears to affect any GL app -- even glxgears. The workaround is to enable indirect rendering with Compiz.

It has been three years since I first ported Factor's UI to OpenGL and I'm still finding new issues in drivers that need workarounds. It's a bit disappointing, but it does seem that in recent times, OpenGL driver quality is improving. I still think OpenGL is a good way to render graphics in a cross-platform manner; it has plenty of features and even runs on mobile platforms (OpenGL ES). Our UI only needs a tiny bit of platform-specific code for each platform: opening windows, creating GL contexts, receiving events, and working with the clipboard. Rendering is cross-platform.

A few people have suggested using Cairo instead, and while Factor has a Cairo binding now, I'm not keen on using it to render the entire UI. I worry that by switching from OpenGL to Cairo, we'll just have to replace one set of bugs and workarounds with another. Another problem is that Cairo would be another external dependency that deployed Factor applications would have to ship. It seems Cairo is quite bloated these days, so that would be unacceptable. One other alternative is to use native APIs: perhaps Cairo rendering on Linux, GDI+ on Windows, and CoreGraphics on Mac. However, this would require too much effort on my part, so its not something I'm willing to dive into at this point.

Monday, November 24, 2008

Comparing the performance of Factor and V8

V8 is, of course, Google's new, much-hyped JavaScript VM. It combines a simple JIT with inline caching, and along with SquirrelFish, it promises to take JavaScript performance to the next level.

Factor does not do inline caching (although this is planned), however the language itself is more static and the compiler performs more advanced optimizations than the V8 JIT. So I thought it would be interesting to see how these two approaches to dynamic language implementation stack up.

I compiled the V8 trunk today on my Intel Core 2 Duo. Factor is of course the latest Factor from git.

I chose the following benchmarks from the language shootout:

spectral-norm
binary-trees
fasta
nsieve
nsieve-bits
partial-sums

Unfortunately, I couldn't get reverse-complement or a few others to work in V8, because they use the readline() function which does not appear to exist. It would have been interesting to compare more benchmarks.

I got the JavaScript versions from the shootout website; the Factor versions are in the Factor repository.

I ran the V8 benchmarks from the command line, like so,
./shell foo.js
And I ran the Factor benchmarks from within the Factor environment.

Without further ado, and without any attempt at post-mortem or analysis, here are the results. All running times are in seconds:

Benchmark:	Factor	V8
binary-trees	3.1	3.8
fasta	8.9	12
nsieve	0.6	1.3
nsieve-bits	1.5	6.9
partial-sums	2.6	2.3
spectral-norm	29	85

Sunday, November 23, 2008

Factor presentation at OTUG

I will be giving a talk about Factor at OTUG, The Object Technology User's Group of Minneapolis/St Paul on December 16th. If you're in the Twin Cities area and are interested in Factor, come check it out! There will be free pizza afterwards.

Tuesday, November 11, 2008

Factor port to Win64 in progress

A year ago I tried to get a working gcc setup on 64-bit Windows with zero success. This time around, I had better luck; the MinGW-Win64 project has a functional gcc port out and I was able to use it to compile the Factor VM.

The main hurdle to overcome is that Win64 uses a different calling convention from Linux/x86-64 and Mac OS X/x86-64. I got the non-optimizing compiler generating valid code, and set things up so that we now have two x86.64 boot images, boot.winnt-x86.64.image and boot.unix-x86.64.image, containing different non-optimizing compiler backends (most of the backend is shared, but some parameters are configured differently). After some messing around, I got it to load various modules, including the optimizing compiler.

The main thing that tripped me up at this stage is that I did not have a working gdb, so certain problems were very hard to debug. For example, on my first reading of the Win64 calling conventions, I missed out that you must reserve 32 bytes in your stack frame for the callee to mess with, and if you put non-volatile data there, then calling any function will clobber your stack frame. This manifested in a most bizarre fashion, namely Factor would work if the VM was with -O3 and crash if compiled with -g; usually we get the opposite, when we find gcc bugs. This time it was a bug in Factor, and the reason it would only crash with optimizations disabled is that the gcc optimizer would use registers to store values, instead of scribbling over the volatile portion of the caller's stack frame. A day of lost work later, I finally figured out the problem, and it was much smoother from thereon in.

Next, I started updating the optimizing compiler backend. Again, the bulk of the work was with figuring out the calling convention. Win64 passes and returns structures differently from the other x86.64 platforms, so I to add some OS-specific hooks to the x86.64 optimizing compiler backend too.

Overall, it seems like Microsoft's calling convention is simpler overall than the one used by Linux and Mac OS X, especially as far as passing structures by value is concerned. Other than the initial hurdle figuring out how the reserved scratch space in stack frame works, I was able to get things working fairly quickly.

The tests all pass now, however testing some real C library bindings still reveals problems (hence the FFI tests are not exhaustive enough and need to be amended). I also need to do a lot of code cleanup before any of this is in a releasable state, so don't get your hopes up yet -- it will be a few more days before you can build Factor from git on Win64. Binary packages will follow soon after; this is mostly conditional on getting various external packages installed, such as PostgreSQL, SQLite, OpenSSL, and gdb. This is due to the fact that the build farm tests our most important C library bindings, and doesn't release anything if the tests fail (which is quite nice; if a bug in the C library binding prevents something important like OpenSSL from working, you really don't want users to come along and download the result).

So while I work on finishing the port in the meantime, here is a screenshot from when I got the FFI just far along enough to render, in a very broken fashion (no text!) the UI workspace and a couple of demos.

Win64 seems rather neglected by the open source community; even gcc is only available in the experimental MinGW snapshots. However I think supporting it is important; moving to x86.64 can offer a performance boost from the additional registers and RIP-relative addressing mode, and the ability to address more than 4GB of RAM is going to be more and more important too. Very soon now Factor will be one of the few (or the only?) open source, natively-compiled, dynamic languages which can run on 64-bit Windows as well as 12 other platforms. Exciting times ahead.

Thursday, November 06, 2008

Further performance gains

One thing I've learned over the last few months is that minor improvements to performance, accumulated over time, are just as important as big architectural changes which result in big gains. With the new code generator in place, I spent some time working on a few tweaks I've been thinking about for a while.

Tuple dispatch overview

First up, I sped up method dispatch on tuples by eliminating some unnecessary conditional branches and memory accesses. In Factor, the object system is implemented in the language, and so all dispatch is done by generated Factor code. This code is generated at compile time. For example, we can define a tuple,

TUPLE: dog name breed ;

and look at the code generated for the breed accessor:

( scratchpad ) \ breed>> def>> .
[
    dup tag dup 2 eq? [
        drop dup { tuple } declare 1 slot { array } declare
        7 slot dup \ dog eq?
        [ drop { dog } declare dog=>breed>> ]
        [ drop object=>breed>> ] if
    ] [ drop object=>breed>> ] if
]

As you can see above, the generated code uses various low-level primitives to accelerate dispatch. Usually you will never work with these primitives directly. The words named class=>generic are not real words in the dictionary; they are word objects corresponding to methods. So methods are essentially words, except they're not part of any vocabulary. Again, this is an implementation detail and you don't need to be aware of how methods are implemented behind the scenes. In the image where I defined this tuple, I had no other tuples with a slot named breed, so the generic accessor only has one method, and the code generated for a generic with one method is very simple. For generics with many methods, more complex code is generated; if you have enough methods, a hashtable dispatch is generated.

The general philosophy of method dispatch on tuples is as follows. To each tuple, we associate an integer which is the length of the superclass chain from the tuple up to the root class of tuples, tuple. I call this the tuple's "echelon". Every tuple instance in the heap holds a pointer to a tuple layout object in its first slot. The tuple layout object holds the echelon, the sequence of superclasses, and the tuple size.

The tuple class methods of a generic word are sorted into echelons, and dispatch proceeds by starting at the highest echelon and going down until a suitable method is found (or an error is thrown). At each echelon, the generic word looks at the corresponding entry in the superclass list of the tuple on the stack, and performs a hashtable lookup. So dispatch runs in O(n) time, where n is the maximum echelon number of the tuple classes participating in the generic word.

For example, suppose we have the following inheritance hierarchy -- its contrieved, because in practice you probably would not have square inherit from rectangle, but it demonstrates a point here:

tuple
  text
  shape
    polygon
      quad
        rectangle
          square
        parallelogram
      triangle
    circle

Now, suppose we have a generic word:

GENERIC: draw ( shape -- )

M: text draw ... ;
M: rectangle draw ... ;
M: square draw ... ;
M: parallelogram draw ... ;
M: circle draw ... ;
M: triangle draw ... ;

Note that if we pass a square instance to draw, we want it to invoke the M: rectangle method. Here are the participating classes, sorted into echelons:

level 1: text
level 2: circle
level 3: triangle
level 4: rectangle, parallelogram

Method dispatch proceeds as follows:

If the tuple on the stack has echelon >= 4, we get the 4th element in its superclass chain, and check if it's rectangle or parallelogram. If so, we dispatch to that method. Note that the 4th element of the superclass chain of both rectangles and squares is rectangle.
If the tuple on the stack has echelon >= 3, we get the 3rd element and check if it's a triangle. If so, we dispatch to that method.
If the tuple on the stack has echelon >= 2, we get the 2nd element and check if its a circle. If it is, we dispatch to that method.
If the tuple on the stack has echelon >= 1, we get the 1st element and check if it's text. If so, we dispatch to that method.

For this generic word, a maximum of four tests must be performed because the inheritance hierarchy is very tall. This situation is rare, since for the most part inheritance is very flat.

Method dispatch: removing conditionals and indirection

The first thing I realized is that every tuple has an echelon level of at least 1, since it must have itself and tuple in the superclass chain. So the conditional test on the final step is unnecessary. Furthermore, if all tuples have echelon 1, there is no need to even load the echelon of the tuple on the stack into a register.

Next, I decided to put the superclass list inside the tuple layout object itself, instead of having the tuple layout object reference an array. This removes one level of indirection, which reduces CPU cache pollution.

To illustrate these improvements, look at the code generated for the call word before these improvements:

[
    dup tag dup 3 eq? [
        drop dup 0 slot 8 fixnum-fast dup 6 eq?
        [ drop { quotation } declare quotation=>call ]
        [ drop object=>call ] if
    ] [
        dup 2 eq? [
            drop dup { tuple } declare 1 slot
            { tuple-layout } declare 5 slot dup 1 fixnum>= [
                drop dup { tuple } declare 1 slot
                { tuple-layout } declare 4 slot
                { array } declare 3 slot { word } declare
                dup \ curry eq?
                [ drop { curry } declare curry=>call ] [
                    dup \ compose eq?
                    [ drop { compose } declare compose=>call ]
                    [ drop object=>call ] if
                ] if
            ] [ drop object=>call ] if
        ] [ drop object=>call ] if
    ] if
]

And after:

[
    dup tag dup 2 eq? [
        drop dup { tuple } declare 1 slot { array } declare
        7 slot dup \ compose eq?
        [ drop { compose } declare compose=>call ] [
            dup \ curry eq?
            [ drop { curry } declare curry=>call ]
            [ drop object=>call ] if
        ] if
    ] [
        dup 3 eq? [
            drop dup { hi-tag } declare 0 slot dup 14 eq?
            [ drop { quotation } declare quotation=>call ]
            [ drop object=>call ] if
        ] [ drop object=>call ] if
    ] if
]

Method dispatch: faster hashcode lookup

If a generic word has more than 4 methods at the same echelon level, a hashtable dispach is generated instead of a series of linear tests. Formerly, the generated code would indirect through the class word to look up the hashcode. Again, this is bad for locality because it pulls in the entire word object's cache line in for no good reason. So what I did was intersperse the hashcodes with the superclass chain in the tuple layout object. The tuple layout object will already be in the cache, since we just read one of the superclass chain entries, so reading the hashcode is essentially free.

Method dispatch: re-ordering tests

I noticed is that the >fixnum generic word had conditionals in an essentially random order. The cut-off between a series of conditionals and a jump table is 4 methods, and >fixnum, defining methods on every subtype of the reals, is the borderline case:

( scratchpad ) \ >fixnum def>> .
[
    dup tag dup 5 eq?
    [ drop { float } declare float=>>fixnum ] [
        dup 4 eq?
        [ drop { ratio } declare ratio=>>fixnum ] [
            dup 0 eq?
            [ drop { fixnum } declare fixnum=>>fixnum ] [
                dup 1 eq?
                [ drop { bignum } declare bignum=>>fixnum ]
                [ drop object=>>fixnum ] if
            ] if
        ] if
    ] if
]

For various reasons, most >fixnum calls are made with a fixnum on the stack. By sorting the methods by tag number before generating the generic, I was able to get it to look like this:

( scratchpad ) \ >fixnum def>> .
[
    dup tag dup 0 eq?
    [ drop { fixnum } declare fixnum=>>fixnum ] [
        dup 1 eq?
        [ drop { bignum } declare bignum=>>fixnum ] [
            dup 4 eq?
            [ drop { ratio } declare ratio=>>fixnum ] [
                dup 5 eq?
                [ drop { float } declare float=>>fixnum ]
                [ drop object=>>fixnum ] if
            ] if
        ] if
    ] if
]

This reduces the number of conditionals in the common case.

I/O system tweaks and `string-nth` intrinsic

I made some changes to io.ports and io.buffers; mostly adding inline declarations. I also added a hint to the push word for the sbuf class; stream-readln would push every character onto an sbuf, and adding this fast-path eliminated some method dispatch. Finally, I made string-nth a compiler intrinsic rather than a call into the VM, which means that words where the compiler infers you're operating on strings will inline the intrinsic and perform less memory loads/stores and subroutine calls than before.

Eliminating useless branches with value numbering

Take a look at what the high-level optimizer does to the following quotation:

( scratchpad ) [ [ { + - * } member? ] contains? ] optimized.
[
    dup length swap 0 -rot \ ( gensym ) [
        pick pick < [
            swap
            >r 2dup
            >r
            >r nth-unsafe dup \ * eq? [ t ] [ f ] if
            [ drop t ] [
                dup \ - eq? [ t ] [ f ] if
                [ drop t ]
                [ \ + eq? [ t ] [ f ] if [ t ] [ f ] if ] if
            ] if r>
            r>
            r>
            swap
            >r rot r>
            swap
            [ 2drop ]
            [ >r >r 1 +-integer-fixnum r> r> ( gensym ) ] if
        ] [ 3drop f ] if
    ] label dup [ ] [ ] if [ t ] [ f ] if
]

A lot of dispatch is eliminated and methods are inlined, but there is some pretty crappy control flow redundancy there. I extended the low-level optimizer to eliminate it when building the low-level IR.

One of the worst offenders is the = word, defined as follows:

: = ( obj1 obj2 -- ? )
    2dup eq? [ 2drop t ] [ equal? ] if ; inline

If obj2 is a word, then equal? expands into 2drop f, so we get

2dup eq? [ 2drop t ] [ 2drop f ] if

Then, dead code elimination converts this to

eq? [ t ] [ f ] if

Ideally, we'd get rid of [ t ] [ f ] if altogether. Instead of doing this in the high-level optimizer, I decided to detect this code pattern, along with its negation [ f ] [ t ] if (which is the inlined definition of not) and emit a branchless comparison, instead:

( scratchpad ) [ \ + = ] test-mr mr.
=== word: ( gensym ), label: ( gensym )

_label 0 
##prologue 
_label 1 
##load-indirect V int-regs 582552 + 
##peek V int-regs 582553 D 0 
##compare V int-regs 582555 V int-regs 582553 V int-regs 582552 cc= 
##replace V int-regs 582555 D 0 
##epilogue 
##return

Now, suppose you have a code sequence like [ t ] [ f ] if [ 1 ] [ 2 ] if after inlining. Value numbering builds up an expression chain where one comparison depends on the result of another, and it is able to collapse them down to a single comparison.

Finally, the empty conditional, [ ] [ ] if, manifests in value numbering as a conditional branch instruction where both successors point to the same node. This is replaced with an unconditional branch.

Results

Here are the timings, in milliseconds, for 10 runs of the reverse-complement benchmark:

5309 5310 5342 5297 5305 5635 5344 5313 5338 5303

This represents a 35% improvement over last time.

Finally, bootstrap is faster too. On my laptop it now takes around 3 minutes 45 seconds, which is about 40 seconds faster than before.

Edit: some more minor optimizations improved reverse-complement runtime by another 500 milliseconds or so.

Monday, November 03, 2008

New low-level optimizer and code generator

I have pushed the new low-level optimizer and code generator. This is the second stage of a big compiler overhaul; the first stage was the new front-end and high-level optimizer. I first started working on the new code generator in July, but after spending a couple of weeks on it, put it aside because I first wanted to replace the front-end and high-level optimizer. I continued working on the code generator in early October.

The new low-level optimizer and code generator replaces the old one-pass code generator. While the old design was relatively simple and generated decent code, it was hard to extend and its limitations were really starting to show, especially on floating point code.

The new design is a complete departure from the old way of doing things. First of all, there is a low-level intermediate representation, consisting of a control flow graph of basic blocks. Each basic block is a list of instructions. Instructions operate on virtual registers, of which there is an unlimited supply, and each virtual register is only assigned once. This invariant, known as SSA, simplifies analysis and code generation.

Building low-level IR

The low-level optimizer receives high-level compiler IR as input. The first step is to convert it to low-level IR. This is done by the compiler.cfg.builder vocabulary. There are two primary differences between high-level IR and low-level IR:

High-level IR represents control structures as nested trees. This explains the compiler.tree vocabulary name. So a piece of code like [ [ 1 + ] [ 1 - ] if 2 * ] becomes a list of three nodes, an #if, a #push, and a #call, where the #if has two more nodes. Low-level IR models the control flow graph directly, so a conditional becomes a node with two successors, where both successors have a common successor. Loops create actual loops in the graph.
High-level IR is stack-based; stripping out value names gives a valid stack program. Low-level IR makes all stack operations explicit; for example, a call to a primitive such as eq? is a single node in high-level IR, whereas in low-level IR it expands into instructions to read the top two stack items, an instruction to compare the two values in registers, and finally, instructions to store the boolean back on the stack:
```
##peek V int-regs 7 D 1 
##peek V int-regs 8 D 0 
##inc-d -2 
##compare V int-regs 9 V int-regs 7 V int-regs 8 cc= 
##inc-d 1 
##replace V int-regs 9 D 0 
```

The high-level IR to low-level IR conversion is very naive. Other than tail-call optimization, it makes no attempt to eliminate any redundancy in the resulting IR. For example, floating point primitives expand into stack loads, followed by unboxing, followed by the operation itself, and finally a boxing operation. Redundant boxing and unboxing of intermediate values is hidden in high-level IR but becomes explicit in low-level IR.

Useless block elimination

Once the CFG has been built, optimization begins. There are a total of 8 optimization passes; I won't talk about all of them, only the interesting ones. The first optimization pass, compiler.cfg.useless-blocks, deletes basic blocks which consist solely of a jump to their successor, as well as conditional blocks where both branches jump to the same basic block. The former case occurs because the low level IR builder does not perform any optimization. The latter case occurs after some code is optimized with the high-level optimizer. For example, suppose you're using contains? to stop iterating early, and don't actually care for the result.

( scratchpad ) [ [ ] contains? drop ] optimized.
[
    dup length swap 0 -rot \ ( gensym ) [
        pick pick < [
            swap
            >r 2dup
            >r
            >r nth-unsafe r>
            r>
            r>
            swap
            >r rot r>
            swap
            [ 2drop ]
            [ >r >r 1 +-integer-fixnum r> r> ( gensym ) ] if
        ] [ 3drop f ] if
    ] label dup [ ] [ ] if [ ] [ ] if
]

Note that the two empty conditionals at the end came up because of inlining and dead code elimination. The high level optimizer's dead code elimination doesn't get rid of the empty conditionals, but the low level optimizer does.

The old code generator did not delete empty conditionals.

Height normalization

After all stack operations in a basic block are expanded into low level IR, there might be a lot of redundant changes to the stack height. The compiler.cfg.height pass takes care of these. For example, >r rot r> translates as follows,

##peek V int-regs 63 D 0 
##inc-d -1 
##inc-r 1 
##replace V int-regs 63 R 0 
##peek V int-regs 64 D 2 
##peek V int-regs 65 D 1 
##peek V int-regs 66 D 0 
##inc-d -3 
##inc-d 3 
##replace V int-regs 64 D 0 
##replace V int-regs 66 D 1 
##replace V int-regs 65 D 2 
##peek V int-regs 67 R 0 
##inc-r -1 
##inc-d 1 
##replace V int-regs 67 D 0

After height normalization,

##peek V int-regs 68 D 0 
##replace V int-regs 68 R -1 
##peek V int-regs 69 D 3 
##peek V int-regs 70 D 2 
##peek V int-regs 71 D 1 
##replace V int-regs 69 D 1 
##replace V int-regs 71 D 2 
##replace V int-regs 70 D 3 
##peek V int-regs 72 R -1 
##replace V int-regs 72 D 0

After the height normalization pass runs, the only height changes that remain are at the start of basic blocks, and all ##peek and ##replace instructions can be viewed as reads and writes of distinguished arrays.

The old code generator performed the same optimization, except stack height changes were finalized at the end of a basic block, not at the start. This makes no difference in practice.

Alias analysis

I won't attempt to give a description of alias analysis here, I'll only explain how it applies to Factor. Alias analysis is implemented in compiler.cfg.alias-analysis. The low level IR builder expands stack operations naively, into memory loads/stores. Once stack height changes are coalesced at the start of a basic block, all stack reads and writes look just like array accesses with constant indexes. In addition to this, we of course also have reads and writes of real heap-allocated arrays and tuples. Alias analysis eliminates redundant memory operations by identifying the following patterns:

If the same location is read twice with no writes in between, the second read is replaced with a copy.
If a location is read after being written, the read is replaced with a copy of the value that was written there.
If a location is written to twice with no read in between, the first write is eliminated.

After alias analysis runs, each stack location is only read or written at most once, and writes follow reads, per basic block. For array slots this is more complicated. Because two references may in general point to the same array, some optimizations are invalid (for example, if we read from one array, write to another, then read from the first array, the latter read cannot be eliminated because the result depends on whether the two arrays are equal or not). However, a freshly allocated array or tuple is known to be different from any other object, so more aggressive optimizations can be performed in this case. One situation where alias analysis really helps is with inline allocation of arrays. Consider the 2array word. After type inference and inlining, the high-level IR looks like so:

( scratchpad ) \ 2array optimized.
[ 2 f <array> tuck >r r> 3 set-slot tuck >r r> 2 set-slot ]

The semantics of the <array> word stipulate that the new array be filled with the initial element, in this case f. However, the array elements are immediately overwritten with stack items. Instead of making 2array and similar words into primitives, or adding a new unsafe constructor for an array filled with garbage, the low level IR builder expands <array> into several instructions: first a memory allocation, then an initial value fill. This makes the redundancy explicit. We get the following before alias analysis:

_label 0 
##prologue 
_label 1 
_gc 
##inc-d -1 
##load-immediate V int-regs 97 16 
##replace V int-regs 97 D -2 
##load-immediate V int-regs 98 7 
##replace V int-regs 98 D -3 
##peek V int-regs 99 D -3 
##allot V int-regs 100 16 array V int-regs 101 
##load-immediate V int-regs 102 16 
##set-slot-imm V int-regs 102 V int-regs 100 1 3 
##set-slot-imm V int-regs 99 V int-regs 100 2 3 
##set-slot-imm V int-regs 99 V int-regs 100 3 3 
##replace V int-regs 100 D -2 
##peek V int-regs 103 D -1 
##peek V int-regs 104 D -2 
##replace V int-regs 104 D -3 
##replace V int-regs 103 D -2 
##replace V int-regs 104 D -1 
##peek V int-regs 105 D -3 
##replace V int-regs 105 R -1 
##peek V int-regs 106 R -1 
##replace V int-regs 106 D -3 
##load-immediate V int-regs 107 24 
##replace V int-regs 107 D -4 
##peek V int-regs 108 D -2 
##peek V int-regs 109 D -3 
##set-slot-imm V int-regs 108 V int-regs 109 3 3 
##write-barrier V int-regs 109 V int-regs 110 V int-regs 111 
##peek V int-regs 112 D 0 
##peek V int-regs 113 D -1 
##replace V int-regs 113 D -2 
##replace V int-regs 112 D -1 
##replace V int-regs 113 D 0 
##peek V int-regs 114 D -2 
##replace V int-regs 114 R -1 
##peek V int-regs 115 R -1 
##replace V int-regs 115 D -2 
##load-immediate V int-regs 116 16 
##replace V int-regs 116 D -3 
##peek V int-regs 117 D -1 
##peek V int-regs 118 D -2 
##set-slot-imm V int-regs 117 V int-regs 118 2 3 
##write-barrier V int-regs 118 V int-regs 119 V int-regs 120 
##epilogue 
##return

Note the large number of ##peek and ##replace instructions, most of which is redundant stack traffic, as well as the multiple calls to ##set-slot-imm, some of which are redundant. After alias analysis eliminates the unnecessary stack and array operations, we end up with:

_label 0 
##prologue 
_label 1 
_gc 
##inc-d -1 
##load-immediate V int-regs 121 16 
##load-immediate V int-regs 122 7 
##copy V int-regs 123 V int-regs 122 
##allot V int-regs 124 16 array V int-regs 125 
##load-immediate V int-regs 126 16 
##set-slot-imm V int-regs 126 V int-regs 124 1 3 
##peek V int-regs 127 D -1 
##copy V int-regs 128 V int-regs 124 
##copy V int-regs 129 V int-regs 124 
##copy V int-regs 130 V int-regs 124 
##load-immediate V int-regs 131 24 
##copy V int-regs 132 V int-regs 127 
##copy V int-regs 133 V int-regs 124 
##set-slot-imm V int-regs 132 V int-regs 133 3 3 
##write-barrier V int-regs 133 V int-regs 134 V int-regs 135 
##peek V int-regs 136 D 0 
##copy V int-regs 137 V int-regs 124 
##replace V int-regs 137 D 0 
##copy V int-regs 138 V int-regs 124 
##copy V int-regs 139 V int-regs 124 
##load-immediate V int-regs 140 16 
##copy V int-regs 141 V int-regs 136 
##copy V int-regs 142 V int-regs 124 
##set-slot-imm V int-regs 141 V int-regs 142 2 3 
##write-barrier V int-regs 142 V int-regs 143 V int-regs 144 
##epilogue 
##return

Note that the redundant memory operations were eliminated, but new redundancies have appeared in the form of ##copy instructions. The next pass eliminates these.

Value numbering

Value numbering is implemented in compiler.cfg.value-numbering. The value numbering pass identifies sets of virtual registers which always contain the same value. In each congruence class, the first virtual register computing the value is known as the leader. References to virtual registers other than the leader are replaced with the leader in subsequent instructions.

To figure out which virtual registers always contain the same value, value numbering makes a pass over instructions and builds a value graph. This is a directed acyclic graph where the vertices are called expressions. Every instruction is considered in turn. The instruction's input virtual registers are replaced with the leaders in their value class. Next, an expression is built from the instruction. The expression is simplified using algebraic identities. If the expression does not have an associated value number, the instruction's output register becomes the leader of a new value class.

Copy propagation falls out as a special case of value numbering. Because the destination of a copy always has the same value as the source, subsequent references to the destination are replaced with the source register.

Floating point unboxing is also a special case. Consider the following low level IR, generated by the Factor code [ float+ 3.0 float* ]:

_label 0 
##prologue 
_label 1 
_gc 
##inc-d -1 
##peek V int-regs 160 D 0 
##peek V int-regs 161 D -1 
##unbox-float V double-float-regs 162 V int-regs 160 
##unbox-float V double-float-regs 163 V int-regs 161 
##add-float V double-float-regs 164 V double-float-regs 162 V double-float-regs 163 
##box-float V int-regs 165 V double-float-regs 164 V int-regs 166 
##replace V int-regs 165 D 0 
##load-indirect V int-regs 167 3.0 
##replace V int-regs 167 D -1 
##peek V int-regs 168 D 0 
##peek V int-regs 169 D -1 
##unbox-float V double-float-regs 170 V int-regs 168 
##unbox-float V double-float-regs 171 V int-regs 169 
##mul-float V double-float-regs 172 V double-float-regs 170 V double-float-regs 171 
##box-float V int-regs 173 V double-float-regs 172 V int-regs 174 
##replace V int-regs 173 D 0 
##epilogue 
##return

The result of float+ is boxed, written to the stack, then loaded from the stack and unboxed. Clearly, this is inefficient.

After alias analysis, the redundant stack store/load is eliminated, but there is still a useless boxing and unboxing, and some useless copies:

_label 0 
##prologue 
_label 1 
_gc 
##inc-d -1 
##peek V int-regs 190 D 0 
##peek V int-regs 191 D -1 
##unbox-float V double-float-regs 192 V int-regs 190 
##unbox-float V double-float-regs 193 V int-regs 191 
##add-float V double-float-regs 194 V double-float-regs 192 V double-float-regs 193 
##box-float V int-regs 195 V double-float-regs 194 V int-regs 196 
##load-indirect V int-regs 197 3.0 
##copy V int-regs 198 V int-regs 195 
##copy V int-regs 199 V int-regs 197 
##unbox-float V double-float-regs 200 V int-regs 198 
##unbox-float V double-float-regs 201 V int-regs 199 
##mul-float V double-float-regs 202 V double-float-regs 200 V double-float-regs 201 
##box-float V int-regs 203 V double-float-regs 202 V int-regs 204 
##replace V int-regs 203 D 0 
##epilogue 
##return

Note that register 194 is boxed into register 195, which is then copied into register 198; finally, register 198 is unboxed into register 200. The value graph contains this path. The expression simplifier then realizes that register 200 always contains the same value as register 194, and thus subsequent uses of register 200 are replaced with register 194.

In addition to eliminating unnecessary boxing and unboxing of floating point values, value numbering also optimizes conditionals. For example, [ 100 [ ] times ] generates the following high-level IR:

( scratchpad ) [ 100 [ ] times ] optimized.
[
    100 0 swap \ ( gensym ) [
        over over fixnum<
        [ >r >r r> r> >r 1 fixnum+fast r> ( gensym ) ]
        [ 2drop ] if
    ] label
]

Implemented naively, fixnum< would compare two values and push a boolean, then if would check if the boolean is true or false. The redundancy here stems from the fact that on a real CPU, you can compare two values and then branch if the result of the comparison is a less-than; there is no need to make a boolean (which itself involves a conditional). The expression simplification machinery used by value numbering converts calls of if where the boolean was produced by a known primitive into more efficient code.

Another neat simplification concerns type checks. The dispatch code generated for generic words often contains code like the following,

[ dup tag 2 eq? [ ... ] [ ... ] if ]

The tag primitive outputs the low 3 bits of a pointer, as a fixnum. Tag #2 is used for tuple objects. Here is the low level IR generated for a call to tag:

##peek V int-regs 208 D 0 
##and-imm V int-regs 209 V int-regs 208 7 
##shl-imm V int-regs 210 V int-regs 209 3 
##replace V int-regs 210 D 0

Note that the low 3 bits of the pointer are masked, then the result is shifted to the left to make it into a fixnum. Of course, if the result of the shift is being compared against a literal fixnum, the shift is redundant; instead, we can shift the literal to the right by 3. Indeed, after value numbering, tag 2 eq? generates this low-level IR:

##peek V int-regs 211 D 0 
##and-imm V int-regs 212 V int-regs 211 7 
##shl-imm V int-regs 213 V int-regs 212 3 
##load-immediate V int-regs 214 16 
##copy V int-regs 215 V int-regs 213 
##compare-imm V int-regs 216 V int-regs 213 16 cc= 
##replace V int-regs 216 D 0

Note that in the above, the ##shl-imm instruction still remains even though its result is not used, and there is a ##load-immediate whose result is not used either. Just like alias analysis inserts redundant copy instructions, value numbering leaves behind instructions which compute values that are never used; if an instruction computes a value that is already available in another register, subsequent uses of this instruction's output register are replaced, but the instruction still remains. The next pass eliminates such useless instructions.

The old code generator performed some optimizations that value numbering can do, such as float unboxing and efficient generation of integer and float conditionals, but it was not as general and there were many redundancies it did not catch.

Dead code elimination

Dead code elimination is an essential part of any optimizer. It is implemented in compiler.cfg.dead-code. Note that the high-level optimizer performs dead code elimination as well, in compiler.tree.dead-code. The rationale is that many optimizations are easier to implement if they can leave behind code which computes values that are never used, instead of trying to eliminate all redundancy in one pass. In the low-level optimizer, both alias analysis and value numbering can leave behind many instructions whose results are no longer used. Dead code elimination builds a graph of def-use chains, then takes the transitive closure starting from the registers used by instructions having side effects observable outside of the basic block, such as ##set-slot and ##replace; any instructions whose destination registers are not part of this closure can be eliminated, since their values are never used. This pass also eliminates ##replace instructions which store a value to the same stack location from which it was read. This can arise if inlining produces an idempotent stack shuffle such as swap swap, or if value numbering reduces an operation to a no-op.

The old code generation took a different approach: since it generated code in one pass, it had to take care not to emit dead code in the first place. This complicated things; for example, constants were not loaded into registers until absolutely necessary, and complicated state had to be maintained to ensure that no stack location was read or written redundantly.

Write barrier elimination

The compiler.cfg.write-barrier pass attempts to speed up slot assignment. Because Factor uses a generational garbage collector, every store into an object slot must potentially mark the card holding the object. However, there are three cases when the write barrier can be omitted:

If the slot value is immediate -- that is, a fixnum or f
If the object was allocated in the same basic block; new objects go in the nursery, so the object cannot be older than the slot value
If the object has already been stored to in this basic block; there is no need to generate more than one write barrier since the GC can only run at the beginning of a basic block

The first case is handled by the low-level IR builder; it uses type information from the high-level IR to omit the write barrier in this case. The latter two cases are handled by a simple analysis that identifies fresh allocations and objects which are written to more than once, deleting ##write-barrier instructions which are seen to be redundant.

Here is the result of applying all the optimizations described so far to the 2array word:

_label 0 
##prologue 
_label 1 
_gc 
##inc-d -1 
##load-immediate V int-regs 566340 16 
##allot V int-regs 566343 32 array V int-regs 566344 
##set-slot-imm V int-regs 566340 V int-regs 566343 1 3 
##peek V int-regs 566346 D -1 
##set-slot-imm V int-regs 566346 V int-regs 566343 3 3 
##peek V int-regs 566355 D 0 
##replace V int-regs 566343 D 0 
##set-slot-imm V int-regs 566355 V int-regs 566343 2 3 
##epilogue 
##return

This is as tight as it gets. There are no redundant memory operations at all, and at this stage recoding 2array as a primitive in the VM would offer no benefit at all.

Write barrier elimination is the last pass of the CFG optimizer. The old code generator performed similar optimizations for storing immediate values into slots, and for storing into freshly allocated objects. It did not eliminate the write barrier in the third case, however.

After this point, the control flow graph is no longer needed, and the CFG is converted into another form called the machine representation. This form consists of a flat list of instructions, with branches and labels instead of control flow graph edges. Otherwise, the instructions are the same as in the CFG representation.

Machine representation

The machine representation builder, or linearizer, in compiler.cfg.linearization, traverses the CFG in reverse post order, generating labels and jumps as appropriate. The linearizer inserts GC checks in basic blocks which allocate memory. This is done as after value numbering, since value numbering eliminates many allocations. The linearizer also performs a simple optimization concerning how conditionals are laid out in memory. Sometimes, reversing a conditional results in one less jump. For example, consider the following code: [ [ 1 + ] unless 3 * ]. Without any type information, the old code generator would emit the following assembly:

0x0000000108e577e0: mov    $0x108e577e0,%rbx
0x0000000108e577ea: pushq  $0x20
0x0000000108e577ef: push   %rbx
0x0000000108e577f0: sub    $0x8,%rsp
0x0000000108e577f4: mov    (%r14),%rax
0x0000000108e577f7: cmp    $0x7,%rax
0x0000000108e577fb: je     0x108e5780a
0x0000000108e57801: sub    $0x8,%r14
0x0000000108e57805: jmpq   0x108e5781c
0x0000000108e5780a: mov    $0x8,%rbx
0x0000000108e57814: mov    %rbx,(%r14)
0x0000000108e57817: callq  0x10894fc20
0x0000000108e5781c: mov    $0x18,%rbx
0x0000000108e57826: mov    %rbx,0x8(%r14)
0x0000000108e5782a: add    $0x8,%r14
0x0000000108e5782e: add    $0x18,%rsp
0x0000000108e57832: jmpq   0x1089663a0

Note that if the comparison fails, the je branch falls through, and then we immediately jump again. Instead, the comparison can be reversed, eliminating the double jump. The new code generator produces the following assembly for the above snippet:

0x000000010980a0e0: mov    $0x10980a0e0,%rax
0x000000010980a0ea: pushq  $0x20
0x000000010980a0ef: push   %rax
0x000000010980a0f0: sub    $0x8,%rsp
0x000000010980a0f4: sub    $0x8,%r14
0x000000010980a0f8: mov    0x8(%r14),%rax
0x000000010980a0fc: cmp    $0x7,%rax
0x000000010980a100: jne    0x10980a11c
0x000000010980a106: add    $0x8,%r14
0x000000010980a10a: mov    $0x8,%rax
0x000000010980a114: mov    %rax,(%r14)
0x000000010980a117: callq  0x1092e9560
0x000000010980a11c: add    $0x8,%r14
0x000000010980a120: mov    $0x18,%rax
0x000000010980a12a: mov    %rax,(%r14)
0x000000010980a12d: add    $0x18,%rsp
0x000000010980a131: jmpq   0x1092ed700

Note that the conditional was reversed into a jne, and now there is no control flow redundancy.

Two-operand conversion

Up until this point, the CFG and machine representations are in SSA (single static assignment) form; each register is only defined once. Instructions are in "3-address" form; an addition is conceptually similar to this C code,

int x = y + z;

This is a problem for x86, where all arithmetic instructions store the result in the first input; x86 is a "2-address" architecture:

int y = y + z;

To be able to generate code for the x86, the compiler.cfg.two-operand pass rewrites

int x = y + z;

into

int x = y; x = x + z;

This inserts ##copy instructions into the IR, and the resulting IR is no longer in SSA form. After this pass runs, there are many new copy instructions in the IR. Register allocation eliminates redundant copies (in the above example, the assignment of y to x would be redundant if x is never used again; in this case we could just destructively modify y.) On PowerPC, this pass is not used, since it is a 3-address architecture.

The old code generator uses a complicated system of "intrinsics" and "templates" to select instructions. There was no low level IR, hence no SSA, instead various ad-hoc heuristics were used to ensure that registers did not get clobbered. This was error-prone, and in the more complex cases, the codegen would give up and spill all registers to the stack before reloading them again. The new approach is much better.

Register allocation

The compiler.cfg.linear-scan vocabulary implements a register allocator. This uses the second-chance binpacking variation of linear scan, amended with a simple form of coalescing to eliminate redundant copies inserted by the two operand conversion pass. Linear scan register allocation is very clever and I will not attempt to describe it here. Two good references are:

Note that the original formulation of linear scan register allocation required scratch registers to be reserved for reloading spilled values, and these scratch registers could not be used for anything else. This is unacceptable on x86, which has a serious shortage of registers. The binpacking variant of linear scan uses live interval splitting to ensure that there is always a register available for reloading spill slots.

Linear scan seems to work quite well in practice, with very little spilling, even on x86.

The old code generator used a very ad-hoc register assignment scheme (I don't even want to call it a "register allocator"). Basically, it would spill all values to the stack (and box them, if they were floats!) as soon as it ran out of registers. So for example, if on x86 it ran out of integer registers but not floating point registers, and you had some unboxed floats in the floating point registers, they would get boxed anyway, leading to terrible performance. To make matters worse, one scratch register had to be reserved because of the on-the-fly nature of the code generator. The new register allocator does not suffer from any of these problems. If you run out of registers, only a minimum number of values will be spilled, spilling floats never boxes them, and there is one additional register available for temporaries on x86.

It's pretty embarrassing that until now, Factor's compiler did not have a real register allocator. We got pretty far with the ad-hoc approach but it was time to do things correctly.

Code generation

After register allocation, machine code is generated using CPU-specific hooks. Since low-level IR instructions are a lot more fine-grained than Factor primitives, there is a corresponding reduction in the complexity of a CPU backend. Despite offering improved performance, the new code generator actually requires less CPU-specific code to be written. Other than that, there isn't much new here, except that jump tables used by generic dispatch, and loop entry blocks, are aligned on the correct boundaries for x86.

Some performance numbers

Soon I'm going to do a comprehensive performance review, comparing Factor's progress over the last year. In the meantime, here are some quick performance measurements of the latest code against the Mac OS X binary from the 21st of October. The computer in question is a Mac Book Pro with a 2.4 GHz Intel Core 2 Duo. For each benchmark, I performed 10 trials, with a full GC before each trial. Times are in milliseconds.

Benchmark	Old codegen	New codegen
dawes	176 179 175 175 174 175 175 178 174 175	145 149 147 148 147 147 148 148 150 148
mandel	299 298 300 299 297 297 297 298 296 300	157 156 155 156 156 155 154 155 157 154
nsieve	996 1003 997 1009 1000 1007 1001 1008 1003 1001	996 954 956 966 961 966 961 964 962 958
nsieve-bits	2398 2403 2397 2412 2415 2410 2403 2397 2405 2424	1840 1838 1835 1840 1833 1833 1839 1847 1829 1833
nsieve-bytes	353 355 347 345 354 356 355 354 350 355	319 327 323 321 318 323 324 322 319 324
partial-sums	3217 3209 3232 3209 3227 3237 3277 3231 3210 3209	2863 2908 2881 2880 2876 2871 2875 2861 2871 2868
raytracer	6536 6559 6538 6550 6545 6554 6533 6556 6508 6581	4918 4918 4936 4906 4907 4941 4935 4931 4929 4898
reverse-complement	9376 9229 9252 9247 9483 9224 9280 9266 9559 9347	9304 9024 9242 9163 9279 9182 9177 9024 9218 8990
recursive	8416 8398 8532 8397 8599 8728 8644 8675 8554 8876	9074 8880 8898 8922 8821 8770 8824 8875 8874 8801
spectral-norm	71282 71251 71224 71172 71184 71179 71181 71418 71268 71183	29585 30331 29927 29873 29644 29829 29905 29864 29954 29897

Amazing speedups on mandel and spectral-norm! The mandel benchmark is now more than 10 times faster than it was before I rewrote the high-level optimizer, and raytracer is almost 3 times faster than it was a month ago.

Some observations

When I first started writing the new code generator, it was a completely distinct body of code from the old one. I got as far as implementing alias analysis and value numbering, and then I decided this was too drastic. Instead took a different approach, where I incrementally refactored the old code generator and merged in the new pieces. I got the old code generator to produce low level IR instead of machine code; it would still perform all optimization in one pass. Then I got rid of he ad-hoc register allocation that it performs, and implemented linear scan. The next step was to refactor it to produce SSA code; after that, I merged in the optimization passes I had already implemented earlier. The result was that I was able to replace the entire code generator, while testing it incrementally along the way. I managed to avoid a bootstrap time regression, despite the fact that the new code generator does a lot more work than the old one, by ensuring that the code gen "pays for itself" by generating better code, and by optimizing some other things.

What's next

As you can see from the benchmarks, performance has improved, especially on floating point code. There is one regression on the "recursive" benchmark, the PowerPC backend is not done, and there are a couple of bugs. Those should all be easy to fix. Longer-term, the next thing I will work on is global optimization. The new code generator, like the old one, only optimizes in basic blocks. This means that a floating point loop that keeps floats on the stack will box them on every iteration, for example; and in general, loops load and store on the stack. Solving this involves making alias analysis and value numbering smarter, and replacing linear scan register allocation with something more advanced, such as chordal graph coloring.

There are other things I want to do as far as performance goes, other than working on the compiler itself. I'm going to investigate inline caching to speed up generic word call sites which are monomorphic, but the compiler is unable to infer this statically. I will also look into various techniques for speeding up polymorphic dispatch. Finally, I will be optimizing the garbage collector at some point.

Saturday, November 01, 2008

West coast trip: 100% success

I got back to Minnesota yesterday. The trip was a lot of fun.

I gave a tech talk at Google. I had less than a week to prepare the talk (we decided to go on short notice) so it was a bit rough but people seemed interested anyway. There were some excellent questions at the end. We hung out in the Bay Area for a few days and met up with many cool people -- Andrew Hintz, Cameron Zwarich, Matthew Willis, Samuel Falvo and Eugene Ciurana.

The VPRI talk in Glendale was even better. It was similar to the Google talk but with some things added and removed. The audience here had more of a background in these types of programming languages so I went a bit faster. Alan Kay attended the talk, which was pretty cool. There was a lot of interesting discussion afterwards with Luke Gorrie, Alessandro Warth, and everyone else whose names I sadly forgot.

At Galois in Portland I only had one hour to present so I had to cut out some of the juicy meat, but it was pretty good anyway. Brian Rice made the trip down from Seattle, and we also got to hang out with Joe Groff, a Factor contributor, and of course Don Stewart and Eric Mertens (another contributor) from Galois. Joe Groff got his friend Jonathan Leto interested in Factor; after we demonstrated Factor's FFI and got some pizza going, he started working on a GSL binding. Cool stuff.

We met lots of interesting people and generated a bit of interest in Factor; lots of new nicknames in #concatenative in the last few days.

Now that I've done enough promotion for the time being, it's time for massive focused action. I'm working on a new low-level optimizer and code generator; this is why I haven't blogged much in the last 3 weeks or so. I first mentioned the new codegen in an older post. Since then, I put it aside to rewrite the high-level optimizer as well; with that out of the way I went back to work on the codegen.

I'll describe it in detail once it is done, but early results are encouraging; performance-wise, as well as the quality and simplicity of the code.

Finally, a photo. Here the #concatenative crew poses in front of Cameron's employer, a somewhat obscure computer company based in Cupertino:

From left to right: Slava, Eduardo, Cameron (cpst), Matthew (yuuki), Doug (erg).