Thursday, August 30, 2007

Partially redrawing an OpenGL scene in response to exposure events and state changes

Most OpenGL applications redraw the entire scene every time a refresh has to be performed. There are exceptions, however. For example, to maintain responsiveness and conserve CPU time, an UI toolkit should only redraw the gadgets which need redrawing, either as a result of exposure events or gadget state changes.

It seems that partial redraws together with double buffering are not well-documented or understood by OpenGL programmers. Since partial redraws are very important to the Factor UI, I'll document what I've learned so far in this blog entry.

Factor's UI always sets the viewport dimensions to the entire window, however a scissor rectable is set using glScissor() to clip rendering to the region being redrawn. Gadgets are always rendered to the back buffer. Once the exposed region has been redrawn, the problem is now to bring it into the front buffer.

The GLX spec does not state whether glXSwapBuffers() takes the scissor rectangle into account.

On some OpenGL implementations, for example the NVidia drivers on Windows, Mac OS X and Linux, the current scissor rectangle is taken into account, and so the Factor UI can just request that buffers be swapped, and this only copies the requested region to the front buffer.

On Windows machines with Intel 9xx graphics adapters, this does not work, and SwapBuffers() copies the entire back buffer to the front buffer. As I documented in yesterday's entry, there exists a special WGL API for copying a partial region of the back buffer to the front buffer. This appears to fix the problem.

On my Linux/PPC machine with an ATI Radeon 9200 card and the open source "radeon" driver built in to X.org, glXSwapBuffers() also ignores the back buffer. Now, the MESA OpenGL implementation used has an extension GLX_MESA_copy_sub_buffer() which adds a glXCopySubBufferMESA() function for copying part of the back buffer to the front buffer, however, this function is buggy! It works some of time, but other times makes the entire screen go black, or kills the X server altogether.

There is a general fallback routine for copying part of the back buffer, but this is sometimes slower than redrawing the entire scene, at least on the Linux/PPC box described above. You can use glCopyPixels() to copy pixels from the back buffer to the front buffer, after setting the destination with glRasterPos():
: gl-raster-pos ( loc -- )
first2 [ >fixnum ] 2apply glRasterPos2i ;

: gl-copy-pixels ( loc dim buffer -- )
>r fix-coordinates r> glCopyPixels ;

: swap-buffers-slow ( -- )
GL_BACK glReadBuffer
GL_FRONT glDrawBuffer
GL_SCISSOR_TEST glDisable
GL_ONE GL_ZERO glBlendFunc
clip get rect-bounds { 0 1 } v* v+ gl-raster-pos
clip get flip-rect GL_COLOR gl-copy-pixels
GL_BACK glDrawBuffer
glFlush ;

Note the fidgety calculations to convert Factor UI co-ordinates ((0,0) is top left corner) to OpenGL window co-ordinates ((0,0) is lower left corner).

I'm still unsure if SwapBuffers() respects the scissor region on other platforms and OpenGL implementations, and if not, what is the correct way to copy part of the back buffer to the front buffer in the fastest possible way.

One other solution I need to look into is finding a way to be notified when the back buffer becomes invalid. If we redraw only part of the back buffer, then call SwapBuffers() to flip the entire back buffer with the front buffer, then in many cases, the previous content in the back buffer is still valid. It becomes corrupt when the video card needs to reclaim 3D memory for other applications, though. If there was a notification when volatile image buffers become out of date, I could force a full redraw on the next refresh, even if only a few gadgets need to be redrawn.

Wednesday, August 29, 2007

C/C++

Every time you say "C/C++" and lump two very different languages together as one, you're telling the world that you don't know either language. Idiomatic C code is very different from idiomatic C++. Good C++ code has very little that is actually C. Even the problem domains they are used in are different. Usually "C/C++" comes up in Java apologist rants ("Java is unmaintainable you say? Well I spent 10 years chasing memory leaks in "C/C++", it was far worse!") however sometimes you see informed people saying "C/C++". Please, don't.

OpenGL redraw bug on Windows

A while ago, "glguy" reported an issue with Factor 0.89: the UI was not being redrawn properly on Windows. I borrowed a Windows laptop and reproduced the exact same issue, and it turns out both me and glguy have the same graphics adapter: an Intel Express GM915.

I found a workaround: if I disable double buffering and don't call SwapBuffers after redrawing the window, then the issue disappears, and furthermore, it appears as if double buffering is still performed!

This is really strange, and I don't think this workaround will work elsewhere, unless WGL always double buffers, and uses hardware page flipping if you explicitly ask for it, or something.

If anybody has UI redraw issues and they're not using this adapter, or not running Windows at all, let me know!

Update: I found the right fix. On Windows, the glSwapHintRectWIN function must be called before SwapBuffers when doing a partial redraw. Turns out this function is not necessary with most drivers, but Intel drivers are picky when it comes to this. Good to resolve this one, finally. The fix is in the darcs repository.

Tuesday, August 28, 2007

Units and reversable computation

Since we don't want space probes written in Factor to crash because of calculations on mismatching units, Doug Coleman wrote a units library:
10 feet 30 inches d+ 56 s d/

It can add, multiply and divide dimensioned quantities while converting on the fly and making sure all the units make sense (10 miles 30 teaspoons d+ is a runtime error).

There was one problem with this library though, and it was that converting a dimensioned quantity back to a scalar required some boilerplate; for example, we have the following code:
: inches ( n -- dimensioned ) 2.54 * cm ;
: inches> ( dimensioned -- n )
dup 0 { m } { } =dimensions?*
dimensioned-value 100 * 2.54 / ;

The inches word converts a number of inches to centimeters; cm in turn converts it to meters, which is the canonical representation of distances, then creates a dimensioned quantity. However the inches> word is ugly, and it is all boilerplate since the defintion of inches has all the required information.

Let's take a look at cm:
: cm centi m ;

And its auxilliary words:
: centi 100 / ;
: m ( n -- dimensioned ) { m } { } <dimensioned> ;

As you can see, <dimensioned> is the canonical constructor.

Now, enter Daniel Ehrenberg's inverse library. While Dan originally intended to use it for pattern matching, this library is a perfect fit for auto-generating words for converting dimensioned quantities back to scalars.

Indeed, out of the box, it cannot invert <dimensioned>, because it calls natural-sort which is not an invertible operation. However, we can explicitly define an inverse for <dimensioned> (not an inverse in the mathematical sense, since the original function is not one-to-one; strictly speaking, this is a section, that is, a function which satisfies only one of the two conditions of an inverse).

And now, everything works; here, undo is Dan's inversion combinator:
  100 cm [ inches ] undo .
5000/127

So 100 cm constructs a dimensioned object, after converting the dimension to canonical form (meters in the case of distance):
  100 cm .
T{ dimensioned f 1 { m } { } }

And [ inches ] undo takes a dimensioned quantity, and works backwards in order to obtain a number, which when passed to inches, would give the original quantity:
  [ inches ] undo .
5000/127

Lets take a look at the code generated by [ inches ] undo:
  [ inches ] [undo] .    
[ >dimensioned< { } =/fail { m } =/fail 100 * 127/50 / ]

It takes the dimensioned quantity apart, makes sure it is a distance (and not a time, etc), then multiplies the quantity itself by a conversion factor.

So we can convert a scalar represented by unit to any other unit in this way; we can also do stuff like:
  10 miles 30 km d+ [ yards ] undo .
19205600/381

What is shorter, a quarter of a mile or 400 meters?
  1/4 miles 400 m d< .
f

We also support compound units, such as miles per second, miles per second squared, etc. How many teaspoons in a liter plus a tablespoon?
  1 L 1 tablespoons d+ [ teaspoons ] undo .
77937/379

For volume units, the canonical representation is meters cubed:
  15 teaspoons .
T{ dimensioned f 379/5120000 { m m m } { } }

Here is a unit with a denominator:
  20 knots .
T{ dimensioned f 463/45 { m } { s } }

The inverse library automatically guards against invalid conversions:
  30 km 10 s d/ [ teaspoons ] undo .
Unification failed

This is seriously cool stuff. The implementation of units is very simple and elegant, too; only half of the code has to be written, because we can use inverse!

The inverse library depends on two things; the simple mapping between syntax and semantics (the composition of two functions is their concatenation, therefore the inverse of a concatenation of two functions is a concatenation of their inverses!) and it also depends on quotations being sequences rather than opaque code blocks. Two qualities which are unique to Joy dialects such as Factor.

HTTP server on factorcode.org

The Factor HTTP server on factorcode.org has been running non-stop since May 7th; that's almost 4 months. This site doesn't get a whole lot of traffic, but it seems that the I/O code and garbage collector are in good shape these days.

Monday, August 27, 2007

Back from Austin

I'm back in Ottawa. Doug, Ed, and all of their friends are very cool and lots of fun to hang out with! Once again, my luggage has been delayed en route, and I'll only get it Tuesday at the earliest. I wonder if the baggage handling infrastructure runs on Java, implemented by the regulars from JavaLobby?

Factor 0.90 will be released before the end of the month. I only have two things left on my to do list! I need to test on all platforms, then I have to fix one last bug concerning UI redraw on certain Windows systems.

I think I will aim for shorter release cycles in the future; say, 2 or 3 weeks. I'll limit the amount of new functionality in each release, so that it doesn't take so long for things to settle down. In the long term, I might try to get rid of releases altogether, and instead have daily builds; a daily build will only be uploaded if all unit and integration tests pass, and if all contributed modules load, and maybe even if there are no major performance regressions in benchmarks. It would be cool to automate something like this.

Sunday, August 26, 2007

When to use locals

Some people have wondered why I implemented the locals library, and if in fact I'm turning my back on stack-based programming and "selling out". In fact this is not the case.

First of all, Factor has already had named variables in the core for quite some time; these are found in the namespaces vocabulary, which implements very simple dynamically-scoped namespaces. There are many cases where dynamic scope is useful, and languages which lack it often end up with a lot of code which passes parameters down long function call chains, with only the leaf functions actually using the parameters. Examples of dynamic scope usage in Factor include the current stream, and various idioms for sequence and association construction. Here dynamic scope is the right thing to use, and it makes total sense, and it cleans up code when certain values are not passed around on the stack.

In a handful of places, we also used dynamic variables as temporaries in calculations. I think once the right abstraction is found, stack code is almost always simpler and cleaner than the equivalent applicative code using named temporaries. But sometimes you have a math formula which operates on 10 values at once, or something similar, and it simply makes more sense to name these intermediate values. In these cases, we used dynamic variables. There are a handful of examples of this so far, such as the MD5 algorithm, and a couple of routines in the calendar library.

However when one uses dynamic variables for temporaries, the dynamic scope semantics no longer help and in fact get in the way; anybody who has used a Lisp with all-dynamic scope can testify. So while most values which would be lexically-scoped in a Lisp program are passed on the stack in Factor, sometimes you still want to name temporaries, and named locals with lexical scope are useful here, because they allow you to create closures and safely pass them around without running into the "funarg problem" which had the Lispers stumped through the 60's.

I never opposed having named variables in Factor, because if I did, they wouldn't be there from day one. Until recently, I didn't feel the need for lexical scope much, though, and when asked about it, my standard reply would be that the stack is Factor's equivalent of lexical scope, and that if somebody coded it up, I'd be happy to add it as an extra library.

Eduardo eventually implemented somewhat inefficient lexical scope in his extra/rewrite-closures library, which still exists because unlike extra/locals, it has first-class environments. I decided to code extra/locals because I wanted to see if I could make locals as efficient as stack values, and the experiment was a success.

So to recap, Factor now has three main methods of passing values between words, and storing temporary values within words:
  • The stack. Factor is and always will be a stack-based language; the whole point of the project is to explore stack-based programming. I encourage beginners to learn to think in terms of the stack before jumping into other abstractions, and for most code, the stack works really well. It is really interesting that in a stack language, function composition is expressed as concatenation, and compared to Lisp, Factor code has a lot less nesting and arbitrary introduction of temporaries. Especially now that Factor supports efficient partial function application, quotations can have internal state, and this brings much of the benefits of closures to the world of the stack.
  • Dynamically scoped variables. These are useful for a set of issues totally orthogonal to temporaries: they're used to give words context-specific behavior, and when an algorithm spanning multiple words needs to operate on a common set of values.
  • Lexically scoped variables. These are used to name temporary values inside calculations which otherwise would be hard to express with the stack. Hopefully this set will reduce over time, but for now, it includes any kind of hairy mathematical operation such as found in cryptography, numerical methods, and so on.

Having more abstractions to choose from when writing code is good, and I think a stack language can express the widest range of them in the most natural possible way.

FactorCon wrap-up

Today is my last day in Austin. On Friday, we went jet-skiing; as it turns out, I'm not so good at driving one, so after a brief attempt which climaxed in me tipping over the jet-ski (with all three people on it), I delegated driving responsibilities to Doug's friend, also named Doug (common name in Texas?).

Eduardo had to go back to work. Me and Doug did a did a bit more hacking, Ed-less -- we got non-blocking UDP working on Windows, and Doug started working on a directory entry change notification binding for Windows. This won't be ready until Factor 0.91, but I'm going to implement the Unix side; kqueue on BSD, inotify on Linux, and whatever crap Solaris has on Solaris. Having a cross-platform directory change notification API is something I've wanted for a while; many other language implementations don't bother with features like this, but I consider it essential.

I also optimized the MD5 library a little bit, and improved the locals library (you can have local word definitions, and locals are writable). Doug updated the libs/math library for the new module system. There's a lot of cool code there; quaternions, polynomials, combinatorics, numerical integration... indispensable tools for hard-core programming.

I'd like to elaborate on one point regrading writable locals. Consider the following Lisp function:
(defun counter (n)
(values
(lambda () (prog1 n (setf n (1- n))))
(lambda () (prog1 n (setf n (1+ n))))))

This returns a pair of closures which increment and decrement a shared counter. Each closure returns the current value of the counter before changing it. The initial value of the counter is the value passed to the function, and each pair of closures has a unique internal counter value.

Using the locals library, the Factor equivalent is:
:: counter | n! |
[ n dup 1- n! ]
[ n dup 1+ n! ] ;

The ! suffix in the list of local parameter names indicates the local should be writable; now, the n! local word can be used to write to it. The Factor code is almost identical to the Lisp code, except with fewer parentheses.

In both cases, you see that we differentiate between binding and assignment, and also we are able to write to a local closed over in an outer scope; the n! word does not create a new internal variable in the counter quotation, but modifies the value in the lexical environment of the counter word.

On the other hand, some languages, such as Python, claim to support lexical closures, however the essential property of closures -- that they close over the lexical scope -- is not preserved, and assigning to a local simply creates a new binding in the innermost scope! Guys, that is retarded! Look at the workarounds these people use. It is really no better than Java's anonymous inner classes, and if you want to call those "closures", you may as well say that C# is a "dynamic" language.

Wednesday, August 22, 2007

FactorCon summer 2007

I'm in Austin right now with my girlfriend - we're staying at Doug Coleman's house. We've been eating mexican food, swimming, and having lots of fun in general. Doug, Eduardo and I have also been hacking up a storm. I've met Doug before when he came to Ottawa, but this is my first time hanging out with Eduardo. He's a really chill guy, sleeps during the day and then stays up hacking until 4 in the morning -- if you ever meet him, ask him to make you some vegan breakfast tacos, they're good at any time of day or night. It is really good to meet hackers face-to-face and write code together -- when one of us gets stuck, the others can help out, and the round-trip time is much shorter than if we were on IRC. I also got a chance to meet Ryan Murphy, who contributed a couple of Factor modules (priority heaps and EditPadPro support); he also lives in Austin.

We've been working on many things:
  • Optimizing generic word dispatch (Slava)
  • Updating the Solaris x86 port (Slava)
  • Windows IPv6 and UDP (Doug)
  • Updating the math library for the new module system (Doug)
  • Optimizing the boids demo (Eduardo)
  • Macros (Slava and Doug)

The macro feature is pretty cool. Factor has had compiler transforms for a while, but macros make them easier to use. For example, if you want to write a word which drops n elements from the stack, a first approximation might be:
: ndrop ( n -- ) [ drop ] times ;

However, this is inefficient since the compiler is unable to infer the stack effect, and in any case, if n is constant we would like to unroll the loop and not have to do it at run time. So we can use a compiler transform:
: [ndrop] ( n -- quot ) \ drop <repetition> >quotation ;
: ndrop [ndrop] call ;
\ ndrop 1 [ [ndrop] ] define-transform

The [ndrop] word outputs a quotation which drops the top n stack elements; this quotation does not loop, it is just drop repeated a number of times. The definition [ndrop] call for the ndrop word is only used in the interpreter or when the value of n is not known at compile time. The compiler transform is invoked when the value is known at compile time, and it outputs a quotation to replace the word with. So in this example, 4 ndrop compiles to drop drop drop drop, which is optimized to a single instruction which decrements the stack pointer. No looping at all.

However, there is a bit of boilerplate in the above code which becomes apparent in more complex examples. With macros, you can now write:
USE: macros

MACRO: ndrop ( n -- ) \ drop <repetition> >quotation ;

So a macro receives its inputs and outputs a quotation -- the compiler invokes macros at compile time and replaces the macro call with the generated quotation, and the interpreter generates a quotation by calling the macro every time.

The cool thing is that the MACRO: word can be defined without modifying the compiler or anything; it is just a simple wrapper around define-transform. Check it out. The same file also defines some short-circuiting boolean operators as macros.

A more involved example of macrology can be found in extra/shuffle - this file defines some hardcore shuffle words (duplicating, dropping, swapping arbitrary numbers of stack elements) and the implementation in terms of macros looks nicer than the implementation using compiler transforms directly.

In fact these macros are exactly like Common Lisp macros -- but whereas people say Common Lisp macros are a key, fundamental language feature which differentiates Lisp from other languages, in Factor we're able to add macros as a library (and it only takes 5 lines). Another "fundamental" Lisp feature, lexically scoped closures, is also just a Factor library -- and it compiles to efficient code to boot, without the compiler having to know anything about lexical scope! I really think Factor is a better Lisp than Lisp -- it pulls off the extensible language tricks just as well or better, and the implementation is simpler and more transparent to boot.

Factor 0.90 is almost done -- I've finished most of the remaining loose ends. Updating the Solaris port was the last big item. This didn't take long because Doug had a Parallels virtual machine with Solaris installed and ready to go.

I'm still in Austin for a few more days. Going to go bike riding, jetskiing hopefully, and I also need to buy a cowboy hat! We're also going to meet up with the RScheme guy. Eduardo knows him from way back in the day; before joining the Factor community, Eduardo was a hard-core Schemer. He wrote a window manager in Scheme which was the precursor to Factory, and he also wrote a DNS server; I've been lobbying for him to port this DNS server over to Factor too. Now that we have UDP support on all platforms, it would be a great stress-test for the network backends!

Friday, August 17, 2007

The great Factor computer shootout

I ran the latest suite of benchmarks on my machines:
  • Power Mac G5 2.5Ghz
  • Athlon 64 2.4GHz
  • Pentium 4 2.8GHz
(I didn't include the ARM-based Gumstix this time.)
Benchmark PowerPC G5Athlon 64Pentium 4
bootstrap1 14805 14217 12617
bootstrap2 437594 326091 412635
continuations 441 312 435
dispatch1 8092 6041 5473
empty-loop 485 209 565
fib1 190 79 129
fib2 847 191 279
fib3 1396 492 580
fib4 4797 2410 3311
fib5 2483 1391 1857
iteration 18200 11353 10810
mandel 4556 2148 4418
nsieve 2603 5069 2372
nsieve-bits 56640 58543 45725
partial-sums 27351 25615 19230
raytracer 25862 23301 24538
recursive 38736 12120 13662
reverse-complement17536 20827 13091
ring 9827 7470 8979
sort 1809 1885 1153
spectral-norm 31879 9383 11731
sum-file 335 277 239
typecheck1 6959 2820 2429
typecheck2 7171 2243 1919
typecheck3 4596 1950 1883
typecheck4 4887 1914 1756

As you can see, the PowerPC backend lags somewhat. The Pentium 4 box is surprisingly fast -- and at 6 years old, it's the oldest machine of the bunch!

gcc is open sores software

In Factor 0.90, I added a new function to vm/os-linux.c which returns the path of the current executable. However it seems that neither gcc 4.0.3 nor gcc 4.2.1 can compile this file on x86. I get an error like the following:
vm/os-linux.c: In function 'vm_executable_path_and':
vm/os-linux.c:20: error: unable to find a register to spill in class 'DIREG'
vm/os-linux.c:20: error: this is the insn:
(insn:HI 16 83 17 2 (parallel [
(set (reg:SI 2 cx [66])
(unspec:SI [
(mem:BLK (reg/v/f:SI 63 [ suffix ]) [0 A8])
(reg:QI 0 ax [70])
(const_int 1 [0x1])
(reg:SI 2 cx [69])
] 20))
(use (reg:SI 19 dirflag))
(clobber (reg/f:SI 68 [ suffix ]))
(clobber (reg:CC 17 flags))
]) 530 {*strlenqi_1} (insn_list:REG_DEP_TRUE 6 (insn_list:REG_DEP_TRUE 12 (insn_list:REG_DEP_TRUE 14 (insn_list:REG_DEP_TRUE 15 (nil)))))
(expr_list:REG_DEAD (reg:SI 19 dirflag)
(expr_list:REG_DEAD (reg:SI 2 cx [69])
(expr_list:REG_DEAD (reg:QI 0 ax [70])
(expr_list:REG_UNUSED (reg:CC 17 flags)
(expr_list:REG_UNUSED (reg/f:SI 68 [ suffix ])
(expr_list:REG_EQUAL (unspec:SI [
(mem:BLK (reg/v/f:SI 63 [ suffix ]) [0 A8])
(reg:QI 0 ax [70])
(const_int 1 [0x1])
(reg:SI 2 cx [69])
] 20)
(nil))))))))

In fact, here is a test case which demonstrates the problem; compile it on x86 with -O3,using gcc 4.2.1, 4.0.3 or 3.4.6 (I tested them all):
#include <stdlib.h>
#include <string.h>

register long foo asm("esi");
register long bar asm("edi");

char * crash_me_baby(char *str) {
char *path = malloc(1 + strlen(str));
return path;
}

I'm sick and tired of the gcc team's total unwillingness to support basic, documented, features. In this case, it is the global register variables which trigger the bug. However, the crash_me_baby() function does not use these variables at all, and in any case they are non-volatile registers which need to be saved/restored, so why the hell would it break gcc?

I submitted a bug to the gcc team. But I'm not holding my breath; I've already filed a report about the same issue a few years ago. This is a recurring problem which has been coming and going since the days of gcc 3.3.

The real way forward is to stop using register global variables. When the Factor VM no longer includes an interpreter, and all quotations are compiled, this will be possible without adversely affecting performance. It will also have the side benefit of making the Factor VM pure ANSI C, with some inline assembly. Which means that on Windows for example, we should soon be able to compile Factor with an alternative compiler, such as Microsoft's Visual Studio.

For now, I've found a workaround -- I had to write my own version of strlen().

Thursday, August 16, 2007

Parallel map and each

To ensure that Factor 0.90 is robust and stable, I'm setting up a "compile farm" for testing Factor on a variety of machines; this involves bootstrapping, loading all modules, running unit tests, and benchmarks.

Additionally, onn Mac OS X, the .app deployment tool is tested in an automated fashion. The deploy.app tool spawns a separate Factor process to ensure that deployment is done in a clean image, then blocks until this process completes. Since I'm on a quad CPU machine, I can save a lot of time by deploying all modules at the same time. I do this using Chris Double's extra/concurrency library. What I want to do is spawn a bunch of threads, then wait until all these threads complete. Each thread calls deploy.app to Factor process, then waits for this process to comoplete. As suggested by Chris, I can use concurrency futures for this.

The idea behind futures is simple. You call future with a quotation yielding a value; this spawns a process to compute the value. Then at a later time, you call ?future which waits until the value is ready, and returns this value.

We can use futures to write a parallel-map combinator:
: parallel-map ( seq quot -- newquot )
[ curry future ] curry map [ ?future ] map ;

This spawns a thread to compute each value, then waits until all values have been computed, collecting them into a new sequence.

We can test this out:
  [ { 1 2 3 4 } [ 1000 sleep 1 + ] parallel-map ] time .
1003 ms run / 0 ms GC time
{ 2 3 4 5 }

Four threads are spawned, and each one takes a second to complete. Compare with an ordinary map:
  [ { 1 2 3 4 } [ 1000 sleep 1 + ] map ] time
4001 ms run / 0 ms GC time
{ 2 3 4 5 }

(Incidentally, the reason the run time does not come out exactly at 1000 and 4000 ms, respectively, is not because a simple iteration over 4 elements takes 3 ms in Factor -- it doesn't -- but because the sleep word has poor resolution. This will be fixed at some point.)

In my specific use-case, I don't care about a return value, I just want to spawn a bunch of threads and wait until they're all done. So I can implement parallel-each in terms of parallel-map:
: parallel-each ( seq quot -- )
[ f ] compose parallel-map drop ;

We simply modify the quotation to yield a null value, then we parallel map this quotation across the sequence and drop the results (which should be a sequence consisting entirely of fs).

Now my automated deployment tester is easy to write:

{
"automata.ui"
"boids.ui"
"bunny"
"color-picker"
"gesture-logger"
"golden-section"
"hello-ui"
"lsys.ui"
"maze"
"nehe"
"tetris"
} [
"deploy-log/" over append <file-writer>
[ deploy.app ] with-stream
] parallel-each

The parallel-map and parallel-each combinators are now in extra/concurrency.

In this case, each thread simply spawns another OS process then waits for I/O to complete, so Factor's simple co-operative scheduler has no problem parallelizing the work. For compute-intensive tasks, this approach is not useful, since Factor cannot take advantage of pre-emptive threading or multiple CPUs.

Tuesday, August 14, 2007

Re-definitions, tuples, and constructors

Suppose you have a source file A.factor:
IN: A

GENERIC: some-generic ( obj -- )

M: some-class some-generic ... ;

And another source file B.factor:
IN: B

M: some-other-class some-generic ... ;

If we load A.factor, then B.factor, then load A.factor again, we would like it if the method defined in B.factor was still defined. So indeed, if the parser sees a re-definition of a generic word, it does not blow away the existing methods. This is an important feature, and it means I can change kernel.factor, for example, and reload it, without blowing away tons of methods on common generic words such as equal?. One expects that re-definitions are idempotent in the sense that re-entering an existing definition does not affect other definitions in the image.

However, tuples violated this principle. When a tuple class is defined with TUPLE:, the object system would give you a default constructor. You could then override this constructor with C:. (For details, see the Factor 0.89 tuples documentation.) Consider if source file A.factor defines a tuple class, and source file B.factor defines the constructor. Now, if we load A.factor followed by B.factor, then we make some changes to A.factor and reload it, but we do not reload B.factor, what happens is that when A.factor is reloaded, the custom constructor previously defined is blown away and replaced by the default one.

This can even be a problem with the tuple and the constructor are in the same source file; for example, in Factor 0.89, reloading sequences.factor in a running image would crash Factor, because in the middle of loading it, the slice class would be re-defined. While a custom constructor follows the definition immediately, the parser relies on slices internally, and before advancing on the next line and parsing the definition custom constructor, it would invoke the default constructor, which would result in incorrect behavior.

So, I've come to the conclusion that the traditional style of tuple constructor definitions is wrong. Now, TUPLE: doesn't give you a default constructor at all. If you want a default constructor which just fills in slot values from the stack, use the new form of C: which just takes a constructor word name and a tuple class name:
TUPLE: color red green blue ;

C: <color> color

Unlike before, the constructor word can have any name:
C: <rgba> color

Custom constructors are now defined as ordinary words. Indeed, the following two definitions are equivalent:
C: <rgba> color

: <rgba> color construct-boa

Here construct-boa is a low-level word taking a class; BOA denotes "by order of arguments", and "BOA constructor" is a pun on "Boa Constrictor", one of the more amusing pieces of terminology in use by the Lisp community.

The construct-boa word is more flexible than the old hard-coded constructor. For example, suppose you have a tuple:
TUPLE: action name counter ;

We wish for the default constructor to take a name from the stack, and store zero in the other slot. Previously, you'd write something like:
C: action ( name -- action )
[ set-action-name ] keep
0 over set-action-counter ;

While this isn't so bad, other non-default constructors would get complicated quickly, even if all they did was fill in slots from the stack and with default values. Now, you can just write:
: <action> ( name -- action )
0 action construct-boa ;

In addition to construct-boa, we have construct-empty:
TUPLE: person name address employer ;

: <empty-person> person construct-empty ;

<empty-person> .
T{ person f f f f }

Finally, there is construct, which takes a sequence of slot setter words and fills these in from the stack:
: <homeless-person> ( name -- person )
{ set-person-name } person construct ;

Previously, the only way to have multiple constructors for a tuple class was to define a general constructor with C:, which possibly took many inputs from the stack and required complex stack shuffling. Now, you can define multiple constructors, with arbitrary names, which do not necessarily share code.

Even more interestingly, you can write constructors parametrized on the class name; it doesn't have to be a literal value! For example, suppose you have a class of errors:
TUPLE: reactor-error reason ;

TUPLE: nuclear-meltdown severity ;

C: <nuclear-meltdown> nuclear-meltdown

We can write a word which throws a reactor error, given a reason:
: throw-reactor-error ( reason -- )
reactor-error construct-boa throw ;

However, we might find that this is poorly factored; if we have a bunch of reasons, such as terrorist-attack, nuclear-meltdown, software-error, etc, we'd have to define constructors for each (even just using C:), then construct an instance before passing it to reactor-error. Instead, we can settle on a standard construction strategy for reasons; for example, construct-boa, and perform construction in the throw-reactor-error word:
: throw-reactor-error ( ... reason-class -- )
construct-boa reactor-error construct-boa throw ;

Now, we can write:
"very severe" nuclear-meltdown throw-reactor-error

In this case, we don't reduce complexity by much; all we eliminated is the C: <nuclear-meltdown> nuclear-meltdown. However, in more complex cases involving delegation, being able to factor out construction logic in this way has been a real boon; I've been able to clean up and simplify a lot of constructors in the UI this way. Finally, there is one less concept to learn; the special C: syntax has been replaced by three ordinary words.

Sunday, August 12, 2007

Fixing a very old bug

Two and a half years ago, Mackenzie Straight reported a bug: stack effect inference would hang when given the following input:
[ [ dup call ] dup call ]

Of course this code hangs when evaluated anyway, but one does not expect a type checker to hang when given a non-terminating program!

For a long time, I had no idea how to fix this issue, and such patterns never came up in real code anyway, so I put it aside. Until today, this was probably the oldest bug which is still open.

Well, now it is fixed!

Some background information first...

Factor's stack effect inference is the basis of the compiler, and a key optimization performed at the inference level is lifting quotations up to their call site. Most of the time this works great, however in the example is rewritten as [ dup call ] [ dup call ] call, then [ dup call ] dup call, and it loops from there. The rewrite system used by the compiler was potentially non-terminating! Calls to curry are rewritten in a similar way, and similar non-termination can occur.

Using the results proven in The Theory of Concatenative Combinators, we can write a turing-complete basis for Factor:
drop
over
curry
>r
r>
call

Using these six words, together with all quotations (recursively) built up from them, we can express any computable function.

This presents a major problem. Because all of these operations -- stack shuffling, calling of quotations, and partial application -- are partially evaluated during stack effect inference, this means that determination of the stack effect of an arbitrary Factor expression becomes equivalent to the halting problem.

In the end, the fix was simple. Stack effect inference just gives up if a quotation calls itself directly or indirectly. Since the set of child quotations nested in a given input quotation is finite, the process either terminates, or eventually gives up with an error:
( scratchpad ) [ [ dup call ] dup call ] infer.
Word: [ dup call ]
The quotation [ dup call ] calls itself.
Stack effect inference is undecidable when quotation-level recursion is permitted.
Nesting: { [ dup call ] }

In real code, we don't do combinatory-logic-style quotation-level recursion; we just define named recursive words. In any case, somebody who wants to write odd stuff like M and Y combinators in Factor will simply have to put up with slower performance because such code will run in the interpreter instead of compiling.

Christopher Diggins's Cat language doesn't suffer from this problem, because his stack effect inferencer is more sophisticated than mine; it supports parametric polymorphism and row types, so quotations don't have to lifted up to their call site in order to determine the stack effect of a combinator. However, when Chris goes on to implement compiler optimizations, he'll find that quotation lifting can bring a huge performance benefit, just like lambda lifting in functional languages. In this case, the non-termination of quotation lifting still has to be considered.

Mixins

Consider typical message-passing OOP (with made up syntax):
class Rectangle {
state w, h;
method area() { ... }
method paint() { ... }
}

class Ellipse {
state major, minor;
method area() { ... }
method paint() { ... }
}

We can add new types of shapes easily, but we cannot define new operations on existing shapes without modifying existing code. In a language with algebraic data types, the situation is reversed (again, made up syntax):
data Shape = Rectangle w h | Ellipse major minor
area shape =
case shape of
Rectangle w h -> ...
Ellipse major minor -> ...

draw shape =
case shape of
Rectangle w h -> ...
Ellipse major minor -> ...

Here, we can define new operations on existing shapes but we cannot define new shapes and have the existing operations work with them.

Generic functions/words give us the best of both worlds. Here's the Factor syntax:
GENERIC: area ( shape -- )
GENERIC: draw ( shape -- )

TUPLE: rectangle w h ;
M: rectangle area ... ;
M: rectangle draw ... ;

TUPLE: ellipse major minor ;
M: ellipse area ... ;
M: ellipse draw ... ;

We can define new shapes implementing the area and draw generic words, and we can define new generic words which implement methods for rectangle and ellipse, without coupling anything together.

Now let's go back to our first message-passing OOP example, and suppose this language supports multiple inheritance of mixins. There are many definitions of what a mixin is, but for now, assume a mixin is a class with no state, just behavior.
class Shape {
abstract method triangulate();

method paint() {
... triangulate the shape with triangulate(),
then render it with a standard algorithm ...
}
}

class Rectangle < Shape {
state w, h;
method area() { ... }
method triangulate() { ... }
}

class Ellipse < Shape {
state major, minor;
method area() { ... }
method triangulate() { ... }
}

We can define new shapes which mixin the Shape mixin, but we cannot add additional mixins to the Rectangle and Ellipse data types.

In traditional Factor, the rough equivalent of the above would be to use a union class:
GENERIC: area ( shape -- )
GENERIC: triangulate ( shape -- triangle-seq )
GENERIC: draw ( shape -- )

TUPLE: rectangle w h ;
M: rectangle area ... ;
M: rectangle triangulate ... ;

TUPLE: ellipse major minor ;
M: ellipse area ... ;
M: ellipse triangulate ... ;

UNION: shape rectangle ellipse ;
M: shape draw ... triangulate, then draw the triangles ... ;

Now we can certainly define new union classes, and make either rectangles or ellipses instances of these union classes, but we cannot extend the shape union class with new shapes.

Until now, the accepted workaround ("design pattern") would be to not define a shape class at all, and instead define a draw method on the object class:
M: object draw ... triangulate, then draw the triangles ... ;

M: beizer-curve draw ... override default implementation ... ;

However, this was crufty. Defining methods on the object class is clutter. Plus, if you lose the shape union, you no longer have a way to distinguish shapes from other objects.

Enter Factor's new mixin classes feature.

Just like generic words generalize message passing OOP and case-based pattern-matching, Factor's mixin classes generalize multiple behavior inheritance and union classes.

We can define a new mixin as follows:
MIXIN: shape

We can specialize generic words on this mixin:
M: shape draw ... default implementation in terms of triangulate ...

But here's the kicker: unlike a union, where all members must be listed up-front, we can declare an existing class to be an instance of an existing mixin:
INSTANCE: rectangle shape

INSTANCE: ellipse shape

I successfully used mixins to remove some repeated code between the various implementations of sequences and associative mappings. Until now, there was no way to specialize a generic word on all sequences; now this is possible:
GENERIC: foo

M: sequence foo ... ;

A mixin is essentially a suite of methods; there is some similarity between Haskell typeclasses and Factor mixins. I'm still discovering the various applications of mixins; after Factor 0.90 is released, I intend to apply them to our I/O stream implementations and remove some boilerplate there.

So to recap, Factor's object system allows the following operations without forcing unnecessary coupling:
  • Defining new operations over existing types
  • Defining existing operations over new types
  • Importing existing mixin method suites into new types
  • Importing new method suites into existing types
  • Defining new operations in existing mixin method suites
  • Defining new mixin method suites which implement existing operations

This is a lot more general than mainstream message passing OOP and allows a wider range of problems to be expressed directly in terms of method dispatch.

Saturday, August 11, 2007

Going to Austin

I will be in Austin, Texas from the 18th of August to the 26th. I'll be hanging out and hacking code with Doug and Eduardo. If anyone else in the Austin area wants to meet up, send me an e-mail.

Friday, August 10, 2007

Smalltalk is dying

I just saw that ObjectArts has discontinued development of Dolphin Smalltalk.

It seems Smalltalk is going the way of the dodo bird. Basically, we're left with a rapidly dwindling number of commercial implementations which are way overpriced considering their relatively low level of polish and functionality; and the only open source option is Squeak, which is not production-ready and never will be. Then there are various dead and semi-dead pre-alpha-stage projects. One commercial Smalltalk which stands out from the rest is Ambrai, but it is in early development, and it is unlikely a proprietary Mac OS X-only product will attract a significant base of library contributions. Perhaps Pepsi will get somewhere, who knows.

A pity, really, considering how computing might look today if things went differently in the early days of Smalltalk.

Update: I wasn't talking about popularity at all. It doesn't matter if Smalltalk isn't popular; for some reason, popular languages tend to suck (Java, Ruby, COBOL, etc.) The problem with Smalltalk is that the implementations are in terrible shape. The closest thing to a usable open source implementation is Squeak, and it leaves a lot to be desired. Common Lisp is an example of a less popular language which doesn't suffer from this problem; open source Lisp implementations are fantastic.

New suite of benchmarks

I spent a bit of time cleaning up various benchmarks and putting them over a common framework. Now you can do "benchmark" run to run a suite of benchmarks. The results are collected and printed in a table:
Benchmark                    Run time (ms) GC time (ms)
benchmark.continuations 473 3
benchmark.empty-loop 480 0
benchmark.fib1 194 0
benchmark.fib2 855 0
benchmark.fib3 1418 0
benchmark.fib4 3334 60
benchmark.fib5 2126 11
benchmark.iteration 8829 23
benchmark.mandel 4421 77
benchmark.nsieve 2221 3
benchmark.nsieve-bits 65679 239
benchmark.partial-sums 26293 183
benchmark.raytracer 26031 242
benchmark.recursive 36740 233
benchmark.reverse-complement 15637 167
benchmark.ring 9124 79
benchmark.sort 1934 38
benchmark.spectral-norm 32142 991
benchmark.sum-file 287 0

I want to set up a continuous testing cluster with a number of machines. The machines would perform the following tasks:
  • Build Factor
  • Run unit tests
  • Run benchmarks
  • Create deployment images for various modules and upload them to a "Factor application central" web site

Tuesday, August 07, 2007

Named local variables and lexical closures in Factor

I previously wrote about adding named parameters to Factor as an extension library. Well, now I've beefed up this code with support for lexical closures, and threw it in extra/locals.

Hopefully, this module can now replace the less efficient rewrite-closures library by Eduardo Cavazos.

Here is an example word using locals:
:: add-test | x y z | x y + z + ;
1 2 3 add-test .
6

The syntax is simple; to define a word which uses locals, you use :: instead of :. Then you follow the word name with a list of locals enclosed in | characters, and use these locals anywhere in the body of the word.

Closure conversion is performed:
:: map-test | seq inc | seq [ inc + ] map ;
{ 10 20 } 5 map-test .
{ 15 25 }

Here, we're using the "inc" local inside the quotation passed to map. The :: word performs all the necessary rewriting to make this work. The key to making this efficient is compiled curry.

You can also define lambdas, which are like quotations except inputs are placed in local variables:
:: map-test | seq inc | seq [| elt | elt inc + ] map ;
{ 10 20 } 5 map-test .
{ 15 25 }

The lambda [| elt | elt inc + ] is equivalent to the quotation [ inc + ], and in both cases closure conversion is performed by the :: word.

Lambdas can be used in normal colon definitions, too:
: foo-bar 1 2 [| x y | x y * ] lambda call ;
foo-bar .
3

The lambda word converts a lambda into code for producing a quotation. If the input to lambda is literal, the code transformation is done at compile time.

So how does the closure conversion actually work? Consider the following code:
[| a b | a [| c | c b + ] map ]

The inner quotation refers to the free variable b. We can rewrite it as follows:
[| a b | a b [| c b | c b + ] curry map ]

That is, we add all free variables to the list of inputs, then fetch these free variables right before pushing the quotation; after the quotation is pushed, we use curry to partially apply the quotation to the free variables. All lambdas and quotations are rewritten this way, as we lift free variables up. In the above example, we have already arrived at a form with no free variables anywhere. Now, we convert to point-free form. This is done in a brute-force way; within a lambda, all locals are stored on the retain stack. Here is the final result of closure conversion applied to the above example:
[
>r
>r r>
r>
dup
>r
>r r>
dup
>r
[ >r >r r> r> dup >r >r r> dup >r + r> drop r> drop ] curry
map r>
drop r>
drop
]

Looks rather complicated, but most of the stack shuffling is optimized away by the compiler.

This vocabulary is not in the darcs repository yet, because it depends on compiled curry, which I'm still in the process of debugging. However, it should work with the latest code from darcs, except words using closures won't compile. I put the code up on LispPaste.

In only 121 lines, I was able to add efficient lexical scoping to Factor. Even though I encourage people to try to express everything in a clean way using the stack only, sometimes locals are handy (complex math formulas, bizarro native APIs taking 11 parameters each, etc). Having more options is always good. Plus, this library makes a great demo of how to extend the language in a radical way without altering the core implementation!

Monday, August 06, 2007

"er" words

Take a look at the new implementation of subset:
: pusher ( quot -- quot accum )
V{ } clone [ [ push-if ] 2curry ] keep ; inline

: subset ( seq quot -- subseq )
over >r pusher >r each r> r> like ; inline

The idea of having a pusher word was originally suggested by Eduardo Cavazos. This word takes a predicate quotation and yields two values, the first being a quotation taking an object as input, and the second being a vector. If you call the quotation with an object for which the predicate yields true, the object is pushed on the vector. The generated quotation "closes over" the vector. Now the only thing subset has to do is pass the pusher quotation to each, and hide the pusher vector on the retain stack until after the iteration is done. When iteration has completed, the pusher vector contains all elements which passed the predicate.

I think there is a very interesting idiom here, and it is worth exploring this idea more.

Also, note that while from the outside, subset is referentially transparent, it uses mutable state internally. While it may be possible to write something just as simple without mutation, I think in many cases, being able to mutate considerably simplifies code; certain algorithms are just more naturally expressed with mutation, at least in a stack language, and integrating abstractions such as monads or uniqueness types into a stack language is still an open research problem (but Christopher Diggins is working on it...) There is something to be said for allowing different programming styles in a language, instead of forcing a single style on the programmer (as in Haskell and other purely functional languages, or even Smalltalk's "everything-is-message-passing-OOP".)

Efficient partial function application (aka compiled curry)

Note: the code described here is not in darcs yet, but will make it there within the new few days after I fix a few residual bugs.

Ever since quotations became a first-class data type, Factor has had a curry word:
  3 [ + ] curry .
[ 3 + ]

This is essentially a partial function application. (I'm aware that currying and function application is not the same, but this word was named before I was aware of that fact, and anyway, curry reads better than papply or partial.)

Previously, this was an expensive word to use. Not only did it allocate a new quotation and copy the existing quotation's elements, but words which called quotations produced by curry did not compile, and had to run in the interpreter; this is due to the compiler's design, which "lifts" all quotations up to their call site.

Now, this has all changed. For interpreted code, curry now runs in constant time, not linear time proportional to the quotation's length, because it allocates a small object holding the object together with the quotation. These objects are instances of a curry class, but they print like quotations and are equal to quotations having the same elements, so they're essentially identical:
3 [ + ] curry [ 3 + ] = .
t

When a curried quotation is called by call in the interpreter, the VM simply pushes the curried object first, then calls the quotation.

In compiled code, we can do even better now. The compiler knows about curry and applies a rewrite rule to the dataflow graph. Essentially, it replaces all occurrences of curry's output value with a pair of values, being the object and quotation; it also "widens" all stack shuffling from the point where the curry is created to where it is called. If the curry is passed to an ordinary word, the compiler inserts a real call to curry which allocates memory; however, the key idea behind this rewrite rule is that curry instantiation is "lazy", and is not done at all if the curry is simply passed downward to a combinator.

For example, consider this example:
: foo ( x y -- x' y' ) 3 [ + ] curry 2apply ;

This word adds 3 to the top two stack elements. The compiler first inlines the 2apply word:
: foo ( x y -- x' y' ) 3 [ + ] curry tuck >r >r call r> r> call ;

Now it applies the rewrite rule:
: foo ( x y -- x' y' ) 3 [ + ] abc--bcabc >r >r >r call r> r> r> call ;

The compiler represents stack shuffles in symbolic form, internally; I'm writing abc--bcabc to mean a shuffle with that effect, for which no word exists in the library.

Now, it lifts the quotation up to the call site:
: foo ( x y -- x' y' ) 3 [ + ] abc--bcabc >r >r >r drop + r> r> r> drop + ;

Finally, it removes the literal value [ + ], since it is now dead; it travels around the stack without ever being used for anything, and is dropped:
: foo ( x y -- x' y' ) 3 ab--bab >r >r + r> r> + ;

So in this case, the compiler will produce the exact same code for the following two snippets:
3 [ + ] curry 2apply
3 tuck >r >r + r> r> +

While [ + ] curry 2apply is not a useful thing to write, since the direct expansion is simpler, it does demonstrate how the compiler is able to remove the object allocation from this code.

So, curry is efficient now, both in the interpreter and compiler. But what does this mean?

The role played by curry here is somewhat like a Lisp lambda which closes over one free variable. We can use curry to package up a bunch of values, pass them to the combinator, and fiddle with them from inside a quotation. Until now, you either had to pay a performance penalty for using curry, or you would use stack juggling tricks. For example, consider the each and 2each combinators. Here is the current implementation:
: (each) ( seq quot i -- seq quot i )
[ rot nth-unsafe swap call ] 3keep ; inline

: each ( seq quot -- )
over length [ (each) ] repeat 2drop ; inline

: (2each) ( quot seq seq i -- quot seq seq i )
[ 2nth-unsafe rot dup slip ] 3keep ; inline

: 2each ( seq1 seq2 quot -- )
-rot 2dup min-length [ (2each) ] repeat 3drop ; inline

Here is an implementation using curry:
: (each) ( seq quot -- n quot' )
>r [ length ] keep r>
[ >r nth-unsafe r> call ] 2curry ; inline

: each ( seq quot -- )
(each) each-integer ; inline

: (2each) ( seq1 seq2 quot -- n quot' )
>r [ min-length ] 2keep r>
[ >r 2nth-unsafe r> call ] 3curry ; inline

: 2each ( seq1 seq2 quot -- )
(2each) each-integer ; inline

Note that each-integer is the new version of repeat, with a slightly altered stack effect.

In the old code, the quotation passed to repeat has to manually save and restore various values that it needs; now that curry is efficient, we just package those values up with the quotation before passing it along to each-integer.

It is somewhat ironic that while Factor is as "concatenative" language, until now concatenating quotations together was discouraged, at least where performance was important! But now, we can easily write an efficient compose word which takes a pair of quotations:
: compose [ >r call r> call ] 2curry ; inline

For example,
[ 2 2 ] [ + ] compose .
[ [ 2 2 ] [ + ] >r call r> call ]

It is pretty clear that this quotation has the same effect when called as [ 2 2 + ]. The difference between compose, which only works on quotations and produces a funny result, and append, which works on all sequences, is that since compose is built from curry, it runs in O(1) time, and the compiler is able to eliminate its runtime cost altogether.

For example, consider the assoc-each combinator. It is implemented in terms of the assoc-find combinator, since assoc-find is the only primitive combinator each assoc instance must implement. On each iteration, we want to call the quotation, but always output f, forcing assoc-find to continue the iteration until the end. Formerly, assoc-each was implemented as follows. First, we had a utility word, assoc-find-with, which called assoc-find while retaining a parameter on the stack. It was in implemented using a assoc-with word which was also used for assoc-each-with, and so on:
: assoc-with 2swap [ >r -rot r> call ] 2keep ; inline

: assoc-find-with ( obj assoc quot -- key value ? )
swap [ assoc-with rot ] assoc-find
>r >r 2nip r> r> ; inline

: assoc-each ( assoc quot -- )
swap [ rot call f ] assoc-find-with 3drop ; inline

This is ugly as hell, and essentially boilerplate. Here is the new way:
: assoc-each ( assoc quot -- )
[ f ] compose assoc-find 3drop ; inline

That's it; no need to define -with words, and no unnecessary stack shuffling. We exploit the "concatenative" nature of Factor here -- composition of sequences is the same as composition of functions!

Which brings me to the next topic; all the "manually curried" combinators you know and love, such as each-with, map-with, and so forth, have been moved to a deprecated vocabulary. They're going away soon. Updating code is easy.
swap [ swap XYZ ] each-with  ===  [ XYZ ] curry each
[ XYZ ] each-with === [ XYZ ] curry* each

Same for map-with, subset-with, and so on. The curry* word is just partial application on the second stack element; it is defined in terms of curry.

All the sequence and assoc combinators have been greatly simplified. In fact, I was able to move the 2swap word out of the core and into extra/shuffle; stack shuffling has been simplified by curry to such an extent that in particular, 2swap is simply not needed anymore. Other more complex shuffle words such as roll and pick are used less frequently now, as well.

By lowering the cost of an abstraction, I was able to simplify code by a non-trivial amount. There's a general theme here: the goal of a compiler is to make idiomatic code run fast. It is not good enough to only accelerate hand-unrolled, tuned code chock-full of irrelevant detail and declarations. Nobody wants to write unnecessarily complex code just to please the compiler.