Wednesday, January 31, 2007

Formal semantics for the Cat programming language

Christopher Diggins has written a paper defining formal semantics for the Cat programming language. I've mentioned Cat here several times; it is currently at version 0.9.9.

ICFP '06 universal machine ported to Factor

Gavin Harrison has written a Factor implementation of the ICFP 2006 contest program. A bit late for the competition, but still fun :-) This is a bytecode interpreter for a simple instruction set. You can find the code in apps/icfp-2006/icfp-2006.factor. It runs quite slowly, but that's a good thing -- it gives me yet another computationally-heavy benchmark to measure compiler improvements against.

ARM port day 2

I worked on the ARM assembler today; it now supports all ARM condition codes (almost every instruction is conditional), as well as "addressing mode 1" instructions (arithmetic), multiplication, and "addressing mode 2" instructions (single value load/store). Multiple load/stores and a few other rarely-used instructions are still missing:

The last time I was porting Factor to a new platform, the compiler worked differently; it would write generated machine code directly to memory, which made the assembler impossible to unit test. Now, the assembler just appends machine code to the array being built by an enclosing call to make, making it very easy to unit test. I did test-driven development today; I used the GNU assembler to assemble bits of code, add a unit test asserting that Factor generated the same machine code for the given input, and code away until the tests passed:

ARM assembly is quite interesting, with lots of operand and addressing modes. The instruction encoding also involves lots of bit fields, so I had to come up with a new abstraction to put together integers with shifts and ors.

The PowerPC instruction format is not as complicated as ARM, so the PowerPC assembler used to just have words which would shift and or values by hand:
: insn ( operand opcode -- ) 26 shift bitor , ;

: a-form ( d a b c xo rc -- n )
>r 1 shift >r 6 shift >r 11 shift >r 16 shift >r 21 shift
r> bitor r> bitor r> bitor r> bitor r> bitor ;

: b-form ( bo bi bd aa lk -- n )
>r 1 shift >r 2 shift >r 16 shift >r 21 shift
r> bitor r> bitor r> bitor r> bitor ;

: d-form ( d a simm -- n )
HEX: ffff bitand >r 16 shift >r 21 shift r> bitor r> bitor ;

: sd-form ( d a simm -- n ) swapd d-form ;

: i-form ( li aa lk -- n )
>r 1 shift bitor r> bitor ;

: x-form ( a s b xo rc -- n )
>r 1 shift >r 11 shift >r swap 16 shift >r 21 shift
r> bitor r> bitor r> bitor r> bitor ;

: xfx-form ( d spr xo -- n )
1 shift >r 11 shift >r 21 shift r> bitor r> bitor ;

: xo-form ( d a b oe rc xo -- n )
>r 1 shift >r 10 shift >r 11 shift >r 16 shift >r 21 shift
r> bitor r> bitor r> bitor r> bitor r> bitor ;

This was not too bad for PowerPC, but for ARM this strategy would have been unmanageable from the start. Here is what the same part of the PowerPC assembler looks like with the new abstraction:
: insn ( operand opcode -- ) { 26 0 } bitfield , ;
: a-form ( d a b c xo rc -- n ) { 0 1 6 11 16 21 } bitfield ;
: b-form ( bo bi bd aa lk -- n ) { 0 1 2 16 21 } bitfield ;
: s>u16 ( s -- u ) HEX: ffff bitand ;
: d-form ( d a simm -- n ) s>u16 { 0 16 21 } bitfield ;
: sd-form ( d a simm -- n ) s>u16 { 0 21 16 } bitfield ;
: i-form ( li aa lk -- n ) { 0 1 0 } bitfield ;
: x-form ( a s b xo rc -- n ) { 1 0 11 21 16 } bitfield ;
: xfx-form ( d spr xo -- n ) { 1 11 21 } bitfield ;
: xo-form ( d a b oe rc xo -- n ) { 1 0 10 11 16 21 } bitfield ;

This expresses the intent of the code much more clearly. The ARM assembler uses much more complicated bitfield specifiers, such as:
: (BX) ( Rm l -- )
{ 1 24 }
{ 1 21 }
{ BIN: 111 16 }
{ BIN: 1111 12 }
{ BIN: 1111 8 }
{ 1 4 }
{ register 0 }
} insn ;

In the above word, we are building a bit field where some values come from the stack, some are literal, and the last one is obtained by applying the register word to a stack value. Writing this out by hand would be a pain in any language. Fortunately Factor makes it very easy to build mini-DSLs like this.

Here are some ARM assembly instructions, with GNU and Factor syntax side by side:
sub ip, fp, #4               IP FP 4 SUB
addeqs r0, ip, r9 lsl #2 R0 IP R9 2 <LSL> S ?EQ ADD
ldr r1, [r5 - #4] R1 R5 4 <-> LDR
str r1, [r5 + #8]! R1 R5 8 <!+> LDR

As you can see, it looks a bit funny, but remember that the whole point of this exercise is to write an assembler library which is called dynamically by the compiler to emit code; this is not for users who want to write applications in assembly.

Another thing I did was sort out how to flush the instruction cache on ARM (thanks to Mackenzie Straight aka eiz in #concatenative), so I provided a proper implementation of the flush_icache function, which is called by the Factor VM to flush the instruction cache when a new compiled code block is added. On PowerPC, this is implemented in assembly, but on ARM only the kernel is permitted to flush the instruction cache so it has to go through a system call. Also, this system call is not actually exported by glibc. However it is easy enough to call it directly using macros from asm/unistd.h:
#define __NR_cacheflush __ARM_NR_cacheflush
INLINE _syscall3(void,cacheflush,void *,start,void *,end,unsigned long,flags);

INLINE void flush_icache(CELL start, CELL len)
cacheflush((void *)start,(void *)(start + len),0);

Tomorrow, I'll do some more work on the assembler and see what other instructions are needed by the backend. I'll also start the backend proper and have Factor compiling simple (and perhaps complex) words. And of course I'll document what I did in a blog entry, just like I did today and yesterday.

Monday, January 29, 2007

ARM port day 1

I've started porting Factor to the ARM architecture, used in embedded devices, handhelds, and phones. Doug asked me to take notes during the porting process, to help anybody who decides to port Factor in the future. At some point I'll be documenting the VM and compiler backend in some detail, so don't expect these notes to be too detailed.

The first step is to get the VM to compile and run a boot image in interpreted mode. It mostly consists of portable C, but there are at least three architecture-specific details which need to be filled out; getting the native stack pointer, native stack frame format, and flushing the instruction cache. I haven't done any research as to whenever the latter is necessary, nor have I studied the ABI extensively yet, so I only filled in the first item for now. All in all, I added the following new files, together with a linux-arm Makefile target:

While testing, I ran into a major problem. A data stack underflow would send the runtime into an endless cycle of signal 11 errors, at seemingly random addresses! This is not what is supposed to happen. When the Unix signal handler receives a memory access violation, it pulls the native stack pointer at the time of the fault out of the signal context, then walks the native stack for compiled Factor frames. Since the ARM port doesn't compile yet this should be a no-op, but in fact the native stack pointer from the faulting context looked like a seemingly arbitrary address.

After some head-scratching, it really seemed like the ucontext_t structure had a different layout in the glibc headers than what the kernel was giving us, and Googling confirmed my suspicions. Turns out this is a known problem and it is already fixed in the glibc trunk, which has not made it to Debian-stable yet. Most less-conservative ARM distros might already have the latest headers. If yours doesn't, you can apply the patch in the e-mail linked above with the following sequence of commands:
cd /usr/include/sys
cat >ucontext.patch
copy and paste the patch here
patch -p8 < ucontext.patch

Alternatively, upgrade your entire libc, but that is not for the faint of heart.

With signal handling working, I ran into a second problem; floating point values saved in the boot image were corrupt. A quick call to double>bits .b revealed that the most significant and least significant cells of the float were swapped from what ARM expects. Fixing this was easy, I added a new bootstrap profile for ARM chips:

Now, all unit tests expect those testing the compiler passed. I stubbed out the compiler backend but didn't write any substantial code for it yet:

For the last several days, I have been studying ARM assembly language, and tomorrow I will begin writing the assembler. Hopefully by the end of the week, I'll have everything except FFI calls compiling. Getting FFI and callbacks right is always the trickiest part, but with ARM I don't have to deal with floating point registers so perhaps it might even be easy.

Anyway, today's efforts were a success:
$ ./f
Factor 0.88 on linux/arm

Thursday, January 25, 2007

Memoization in Factor

I implemented a simple memoization library in libs/memoize. If you're not familiar with this term, read about Memoization on Wikipedia.

This library can create memoized forms of words taking between 0 and 4 inputs, and one or more output. Usage is very simple -- just replace : with MEMO:, and ensure you have a stack effect declaration.

Here is the old inefficient fibonacci:
: fib ( m -- n )
dup 1 <= [ drop 1 ] [ dup 1 - fib swap 2 - fib + ] if ;

[ 34 fib ] time
1842 ms run / 0 ms GC time

Now lets define a memoized version:
MEMO: fib ( m -- n )
dup 1 <= [ drop 1 ] [ dup 1 - fib swap 2 - fib + ] if ;

[ 34 fib ] time
0 ms run / 0 ms GC time

Well, that's a nice improvement, 1842/0 = infinity times faster ;-) And only one change was necessary, the replacement of : with MEMO:.

I decided to implement this after seeing Dan's ONCE: word. Basically its a restricted form of memoization, where you have a word which performs a lengthy computation on no inputs the first time, and caches the result for subsequent calls:
ONCE: big-number 1000 factorial ;

This was used in the XML parser to build up fast lookup bit sets classifying Unicode characters from a more declarative description (see char-class.factor).

My MEMO: word generalizes ONCE:; if the word being memoized has no inputs, then it simply caches the result the first time it is called.

Take a look at libs/memoize/memoize.factor; I think it is pretty elegant.

Update Dan contributed some documentation, and made this work with words producing more than one output. Thanks Dan.

Sunday, January 21, 2007

Removing duplicates, boolean arrays, ARM

Comp Sci lesson learned: For some reason I was always under the impression that one cannot remove duplicates from an array while preserving order in O(n) time. This is why Factor had two words, prune and hash-prune.

The former preserved order and ran in quadratic time; it would loop over the input sequence, and see if each element was already present in an accumulator, if not, it would add it to the accumulator.

The latter did not preserve order and ran in linear time; it would add all elements as keys in a new hashtable, then output the sequence of keys.

Well it turns out there's a way to remove duplicates while keeping order in O(n) time. The idea is that as you inspect each element of the input sequence in turn, you check if it is present in a hashtable; if it is, you keep going, otherwise you push the element on an accumulator and add it to the hashtable. I guess this is obvious to most people but for some reason I was not aware of this algorithm and did not even think this is possible. The implementation is simple enough, and I changed the prune word to use this algorithm, and removed hash-prune:
: (prune) ( hash vec elt -- )
rot 2dup hash-member?
[ 3drop ] [ dupd dupd set-hash swap push ] if ; inline

: prune ( seq -- newseq )
dup length <hashtable> over length <vector>
rot [ >r 2dup r> (prune) ] each nip ;

A new data structure: The Unix native I/O code has had some words for managing a space-efficient bit array for quite some time. These words were not really useful outside of Unix I/O, and were not safe either. I cleaned them up and made them more useful; now Factor has a boolean-array class which supports the sequence protocol and behaves just like an array, except it can only store t or f, and it uses much less space. Dan asked for this to make the XML parser more efficient, and it was about time Factor got a boolean array type, anyway.

ARM: Doug Coleman got me a Netstix ARM computer as an early birthday present. Thanks Doug! I was going to buy one anyway when I felt like porting Factor to ARM, but I guess he really wanted an ARM port now :-) So I've decided to make this port a priority for 0.88. I'm still finishing some compiler changes and setting up the software on the Netstix, but once that is done I will start working on the port. Running on ARM will be nice, because it will open up the world of cellphones and handhelds to Factor. Especially because Factor is both high level and fast.

Friday, January 19, 2007

Converting an Excel file to CSV - Factor makes XML fun!

Today I was sent a rather important Microsoft Excel document. Unlike older binary formats, this was an XML file, and the version of OpenOffice I had installed on my Debian-stable Linux workstation was too old to cope with it. Rather than upgrading OpenOffice on the Debian box, or installing OpenOffice on my Mac, I decided to write a Factor program to convert the data to CSV, comma separated values. OpenOffice can open that for sure. Just like my previous foray into parsing XML with Factor, writing this code only took a few minutes, including testing, in no small part due to Daniel's excellent documentation.
REQUIRES: libs/xml ;
USING: io sequences xml xml-utils ;
IN: msxml>csv

: print-csv ( table -- ) [ "," join print ] each ;

: (msxml>csv) ( xml -- table )
"Worksheet" find-tag
"Table" find-tag
"Row" find-tags [
"Cell" find-tags [
"Data" find-tag children>string
] map
] map ;

: msxml>csv ( infile outfile -- )
<file-writer> [
<file-reader> read-xml (msxml>csv) print-csv
] with-stream ;

Of course this only handles a very tiny subset of the Excel file format, but it was good enough to convert the file to CSV without losing any information.

The XML file used XML namespaces, but the parser handled that without problems. Daniel was very clever by allowing the programmer to either specify simple strings, or fully-qualified name tuples when searching for tags; allowing quick-and-dirty code to proceed as if the namespaces weren't there, but supporting the full XML specification when one needs it.

The (msxml>csv) word looks pretty in its current state, however for more complex XML processing I can easily imagine searching for tags can get out of hand very quickly. An XPath-like DSL would be a welcome addition to libs/xml, but I need to spend more time working with XML in Factor before I can come up with a Factorish solution. None of the XPath-like solutions I've seen in other languages were really satisfactory.

It is also interesting that the program above contains no stack manipulation at all.

Faster performance on the reverse-complement benchmark

A while ago, I did some Factor I/O benchmarking. At the time (around version 0.83), Factor's performance on the benchmark was poor, with a total run time of 170 seconds, 120 of which were spent reading the input file line by line. The input file in question is 25 Mb in size, consisting of 416671 lines of text.

Before I began optimizing today, current Factor ran the benchmark in 140 seconds; a slight improvement from all the compiler work between 0.83 and 0.88. Still unacceptable, however. The first thing I did was reimplement how line reading is done. Previously, the stream-readln word would call stream-read1 to read data a character a time, and form lines. While input is buffered, this nevertheless incurred considerable overhead, because of generic dispatch, bounds checks on the I/O buffer, and other factors. To rectify this, I added a new stream-read-until generic word which reads data from a stream until a character is reached which appears in a given sequence of separators. The Unix native I/O code has an efficient implementation of this word which scans the input buffer for separators, without a large amount of indirection and redundant work on every iteration. I rewrote the line reading code to call this word with a separator sequence of "\r\n" (and some extra logic to handle Windows-style \r\n line endings). This reduced the time to read the file line by line from 90 seconds to 3, and the total benchmark runtime went from 140 seconds to 80 seconds. A nice improvement, but where was the remaining time being spent? Turns out the compiler was not inferring all the type information that it could. Right now, the Factor compiler only infers types inside a word definition, and assumes that non-inlined calls to other words always produce objects of the most general type (with a few exceptions for primitives and certain low-level words of which the optimizer has special knowledge). While I plan on adding global type inference to 0.89, for now marking the <sbuf>: constructor word inline was sufficient to give the compiler enough type information in the inner loop of the benchmark to cut the run time down to 33 seconds.

33 seconds is still far too slow for this problem - as I wrote in the original weblog entry where I noticed the lousy I/O performance, on my machine Java runs the benchmark in 3 seconds. However I was able to increase the runtime by a factor of 3 with relatively little effort (and no changes to the benchmark, other than a cleanup which only slightly improved performance). Also I/O is no longer the major bottleneck. Instead, some 27 seconds are spent simply reversing sequences.

If you look at the performance of other languages on this benchmark in the computer language shootout, you'll notice that the languages that do well either delegate all the hard work to C, despite being rather slow themselves (Perl, Python) or do not suffer from the overhead of dynamic type dispatch (bigForth because Forth is inherently untyped, and the SBCL benchmark implementation which completely bypasses CL line reading functions and switches off safety checks with declarations).

In Factor's case, not only are I/O and sequence operations are written in high-level Factor code, but the language is highly dynamic, and the optimizer is still not particularly sophisticated, despite continuous improvements. I'm confident however that within a few months, Factor will match bigForth's performance in most cases, and at that point I might consider submitting Factor to the language shootout.

Also a curious note is that there are two bigForth programs listed in the reverse complement benchmark; both are almost identical, except one program does extremely well, and another one is the slowest out of all language implementations by an order of magnitude. I wonder what's going on there.

Friday, January 12, 2007

A compiler optimization

Here are the times to run the demos/raytracer benchmark in recent Factor versions:
VersionTime (seconds)

I'm still not sure what's behind the regression from 0.85 to 0.86, maybe I can reclaim that time too.

The performance gain comes from improved type inference. The compiler is now able to remove overflow checks on loop index variables if it can prove that they are bounded above by a fixnum. The raytracer benefits form this because it calls vector
arithmetic words with arrays as arguments, and the length of an array is always a fixnum.

Another less impressive gain is found for a similar reason when reading a 3mb file into a string:
VersionTime (seconds)

The code to read a file into a string:
"/Users/slava/repos/Factor/boot.image.ppc"  4 1024 * 1024 *  [ stream-copy ] keep >string drop

I/O performance still sucks, as you can see. I omitted the 0.85 and 0.86 results because they are the same as 0.87.

I will now explain where the performance gain comes from.

The running example which will be used in the below description is the following short snippet:
dup type 0 eq? [ dup 5 < [ 1+ ] when ] when

The compiler here is able to remove both the type dispatch and overflow check on the 1+, transforming it from a full-fledged call to a polymorphic word into a single instruction. This example is contrived but it is exactly the simplest form of the very realistic case of a loop counter being incremented in a recursive word, bounded by a fixnum such as an array length.

The type inferencer maintains two forms of state; the association between values and their inferred classes in the current control flow path, and a map of constraints, where a constraint is basically a logical statement of the form "the value X has class Y". In the constraint map, each key/value pair is like an implication; if the constraint in the key becomes true, then the inferencer assumes that the constraint in the value becomes true.

In the above example, type inference proceeds as follows. When the inferencer sees the call to type, it adds a bunch of constraint pairs to the map of constraints:
(the output of 'type' is equal to 0) -> (the input is a fixnum)
(the output of 'type' is equal to 1) -> (the input is a bignum)

Then, the inferencer sees the eq? word, and notices that one of the inputs is a literal. This forces a new constraint pair to be added to the constraint map:
(the output of 'eq?' is equal to t) -> (the first input of 'eq?' is equal to 0)

Next, the inferencer proceeds to infer types inside the when conditional. Upon entering the body of the conditional, the inferencer makes an assumption:
(the type of the top of the stack is t)

Upon making this assumption, it checks the constraint map for any additional facts that this assumption may imply. As it turns out, the constraints set up previously yield some information. It was recorded that the assumption that the top of the stack is true implies that (the first input of 'eq?' is equal to 0), which implies that (the input to 'type' is a fixnum).

So inside the branch, the type inferencer now knows that the top of the stack is a fixnum. The call to < is inlined and the type dispatch is removed, thus replacing it with fixnum<.

This is where the fun begins. If fixnum< returns true, then the first input is less than the second. This is completely obvious. However, the second input must be less than or equal to the most positive fixnum. So by transitivity, the first input must be strictly less than the most positive fixnum. The type inferencer knows this, and the fixnum< word defines a constraint to this effect.

There is a new type in Factor now:
PREDICATE: fixnum small 1+ fixnum? ;

The fixnum< word defines a constraint of the form:
(the output of 'fixnum<' is t) -> (the first input is small)

Once again, we have a branch, inside which the type inferencer assumes the top of the stack is t, which by a previous implication forces the type inferencer to assume the top of the stack is small.

At this point, the call to 1+ is reached, and the definition of the polymorphic 1+ is inlined and specialized to the case of a fixnum input. This replaces the call to 1+ with 1 fixnum+.

The type inference is almost done. However, there is still one thing to do. An optimizer hook for fixnum+ word notices that one of its inputs is small and the other input is 1, and this replaces the call to fixnum+ with fixnum+fast, which has no overflow check and compiles to a single instruction.

Here is an example of very pretty stack code defining the type inference rule for <:
! if a is less than b and b is a fixnum, then a is small
\ < [
small 0 `input class,
fixnum 0 `input class,
fixnum 1 `input class,
general-t 0 `output class,
] set-constraints
] "constraints" set-word-prop

It is worth noting that as long as your integers still have arbitrary precision, none of these optimizations would be any easier with a static type system. I'm sure GHC does something very similar to this.

Wednesday, January 10, 2007

Command-line editing in the TTY listener

If for whatever reason you don't want to run the Factor UI but still want input editing, history and completion, read the article on using rlwrap with Factor. The author is a new Factor programmer, and he implemented Fast Fourier Transform. It looks nice. I cleaned it up a bit and added it to darcs, demos/fft.factor.

Tuesday, January 09, 2007

Tool to tally contributor patch counts

I wrote a little Factor program which runs displays the number of patches submitted by each contributor to the Factor darcs repository. You can find it in demos/contributors.

It works as follows:
  • spawns a darcs process, asking it to emit a changelog in XML
  • parses the XML and extracts author attributes of patch tags
  • Computes the tally

This code uses the new hash-prune word which is only found in 0.88. It is like prune (which removes duplicates from a sequence) except that it does not retain order, and is much faster.

Here is the output, with actual e-mail addresses censored:
{ 1535 { "slava@..." } }
{ 270 { "chris.double@..." } }
{ 226 { "erg@t..." } }
{ 180 { "wayo.cavazos@..." } }
{ 50 { "matthew.willis@..." } }
{ 33 { "microdan@..." } }
{ 11 { "Benjamin Pollack <benjamin.pollack@...>" } }
{ 7 { "chapman.alex@..." } }
{ 4 { "Kevin Reid <kpreid@...>" } }
{ 2 { "lypanov@..." } }
{ 1 { "agl@..." } }

I'd like to do more with XML in the future, so I can hopefully suggest some new abstractions to Daniel, and help clean up the naming scheme of the XML processing words. I think Factor has the potential to simplify XML processing considerably over many other languages.

Here is the code:
REQUIRES: libs/process libs/xml ;
USING: memory io process sequences prettyprint kernel arrays
xml xml-utils ;
IN: contributors

: changelog ( -- xml )
image parent-dir cd
"darcs changes --xml-output" "r" <process-stream> read-xml ;

: authors ( xml -- seq )
children-tags [ "author" <name-tag> prop-name ] map ;

: patch-count ( authors author -- n )
swap [ = ] subset-with length ;

: patch-counts ( authors -- assoc )
dup hash-prune [ [ patch-count ] keep 2array ] map-with ;

: contributors ( -- )
changelog authors patch-counts sort-keys reverse . ;

PROVIDE: demos/contributors ;

MAIN: demos/contributors contributors ;

Monday, January 08, 2007

Warm fuzzies

Factor seems to be getting a fair bit of exposure lately. I can tell because people have started posting uninformed commentaries. While Factor is not the main subject of that article, the author mentions a lot, and seems to argue that Haskell's referential transparency is good because stack languages and in particular Factor have no abstractions other than the stack (non-sequitur?) I replied to this post in the most polite way I could, but perhaps I should not have bothered. In any case, Factor doesn't even have 1% of the uninformed criticisms being written about it as Lisp, but progress is progress!

Two promising Haskell projects

Came across these two projects in the recent days:

Vital is the earlier effort, written in Java and no longer maintained. I mention it here because right now it is the most complete. Pivotal is a reimplementation in Haskell, compiled with GHC.

Both are document-based, graphical, interactive environments for programming Haskell based around functional reactive programming and direct manipulation. Basically a Self/Smalltalk-like interactive environment, for Haskell! Vital only implements a small subset of the language which is dynamically type checked; I'm not sure how far along Pivotal is, language-wise.

Pivotal is a project to watch. If it matures, then Haskell will be able to break free from the edit/compile/run cycle. (Which is a big if, sadly; projects in academia have a tendency to die off as soon as the guy finishes his thesis) And of course, we will finally see a statically typed language with a nice interactive development environment! (You can say all you want about the feature length of IDEA, Eclipse etc but they're not the type of development environment I have in mind when I say "development environment". They're more like fancy text editors).

Wednesday, January 03, 2007

Factor 0.87 released

Factor 0.87 is now available. Check out the change log. The number of new and updated contributed libraries is huge. Props to all the people hacking around with Factor.

Monday, January 01, 2007

Showing the stack effect for words under the caret

Happy new year's to all the readers of my web log! Here is a new feature I implemented last night:

2007 should be a very interesting year for Factor.