Wednesday, February 28, 2007

Factor 0.88 released

Factor 0.88 is now available from the Factor web site. There is a change log. And last but not least, Eduardo Cavazos has prepared a new graphical demo called the "golden section":

To run it, you must first load Ed's experimental module system, which will actually become the default module system in Factor 0.89. In the UI, enter the following:
"libs/vocabs" require
USE: golden-section

Tuesday, February 27, 2007

Random maze generator

I couldn't resist and ported another one of Jon Harrop's demos to Factor. This time it is the Random Maze Generator.

The code:
! From
USING: sequences namespaces math opengl arrays gadgets
gadgets-theme kernel ;
IN: maze

: line-width 8 ;

SYMBOL: visited

: unvisited? ( cell -- ? ) first2 visited get ?nth ?nth ;

: ?set-nth ( elt i seq -- )
2dup bounds-check? [ set-nth ] [ 3drop ] if ;

: visit ( cell -- ) f swap first2 visited get ?nth ?set-nth ;

: choices ( cell -- seq )
{ { -1 0 } { 1 0 } { 0 -1 } { 0 1 } }
[ v+ ] map-with
[ unvisited? ] subset ;

: random-neighbour ( cell -- newcell ) choices random ;

: vertex ( pair -- )
first2 [ 0.5 + line-width * ] 2apply glVertex2d ;

: (draw-maze) ( cell -- )
dup vertex
GL_POINTS [ dup vertex ] do-state
dup vertex
dup visit
dup random-neighbour dup [
(draw-maze) (draw-maze)
] [
] if ;

: draw-maze ( n -- )
line-width 2 - glLineWidth
line-width 2 - glPointSize
1.0 1.0 1.0 1.0 glColor4d
dup [ drop t <array> ] map-with visited set
{ 0 0 } dup vertex (draw-maze)
glEnd ;

TUPLE: maze dlist ;

C: maze ( -- gadget )
dup delegate>gadget
black <solid> over set-gadget-interior ;

: n ( gadget -- n ) rect-dim first2 min line-width /i ;

: delete-maze-dlist ( maze -- )
dup maze-dlist [ delete-dlist ] when*
f swap set-maze-dlist ;

: cache-maze-dlist ( maze -- dlist )
dup maze-dlist [ ] [
dup GL_COMPILE [ n draw-maze ] make-dlist
dup rot set-maze-dlist
] ?if ;

M: maze layout* delete-maze-dlist ;

M: maze ungraft* delete-maze-dlist ;

M: maze draw-gadget*
origin get [ cache-maze-dlist glCallList ] with-translation ;

M: maze pref-dim* drop { 400 400 } ;

: maze-window <maze> "Maze" open-window ;

PROVIDE: demos/maze ;

MAIN: demos/maze maze-window ;

Factor meets the Stanford Bunny

I ported another one of Jon D. Harrop's Ocaml demos to Factor, this time the Stanford Bunny. From the page: "The Stanford bunny is a 3D mesh of triangles commonly used as a benchmark for computer graphics applications."


I found an FFI bug while working on this; on Mac OS X, Factor would spill floating point parameters on the C stack if there was more than 8 of them, however this is only correct for Linux. The Mac OS X PowerPC ABI stipulates that the first 13 floating point parameters are passed in registers.

I also had a problem where the depth buffer was not working on Mac OS X. Thanks to Kevin P. Reid (kpreid on #concatenative) who asked:
kpreid: well, did you ask for a depth buffer? :)

Turns out I did not ask for a depth buffer, and I had to change the code which creates the Cocoa GLView to request one. This is not a good solution since now every UI window has a depth buffer, wasting memory. And of course at some point somebody will come along and want to do something with accumulation buffers, stereo display, and other odd stuff. So eventually I'll need a portable way to configure the GL context. For now, changing the Cocoa, X11 and Windows UI backends to always request a depth buffer will suffice.

I wanted to ship this demo with Factor, but the data file is 2.9 Mb. So what the program does is download the data file (using libs/http-client) and save it the first time it is run. Clean and elegant.

Source code - this time it's only 4 lines longer than the Ocaml version:
! From
USING: alien-contrib arrays sequences math io kernel
matrices opengl shuffle gadgets http-client tools
vectors timers namespaces ;
IN: bunny

: numbers ( str -- seq )
" " split [ string>number ] map [ ] subset ;

: (parse-model) ( vs is -- vs is )
readln [
numbers {
{ [ dup length 5 = ] [ 3 head pick push ] }
{ [ dup first 3 = ] [ 1 tail over push ] }
{ [ t ] [ drop ] }
} cond (parse-model)
] when* ;

: parse-model ( stream -- vs is )
100000 <vector> 100000 <vector> (parse-model)
] with-stream
over length # " vertices, " %
dup length # " triangles" %
] "" make print ;

: n ( vs triple -- n )
[ swap nth ] map-with
dup third over first v- >r dup second swap first v- r> cross
vneg normalize ;

: normal ( ns vs triple -- )
[ n ] keep [ rot [ v+ ] change-nth ] each-with2 ;

: normals ( vs is -- ns )
over length { 0.0 0.0 0.0 } <array> -rot
[ >r 2dup r> normal ] each drop
[ normalize ] map ;

: read-model ( stream -- model )
"Reading model" print flush [
<file-reader> parse-model [ normals ] 2keep 3array
] time ;

: model-path "demos/bunny/bun_zipper.ply" ;

: model-url "" ;

: maybe-download ( -- path )
model-path resource-path dup exists? [
"Downloading bunny from " write
model-url dup print flush
over download
] unless ;

: draw-triangle ( ns vs triple -- )
dup roll nth first3 glNormal3d
swap nth first3 glVertex3d
] each-with2 ;

: draw-bunny ( ns vs is -- )
GL_TRIANGLES [ [ draw-triangle ] each-with2 ] do-state ;

: cache-bunny ( triple -- displaylist )
GL_COMPILE [ first3 draw-bunny ] make-dlist ;

TUPLE: bunny-gadget dlist model ;

C: bunny-gadget ( model -- gadget )
<gadget> over set-gadget-delegate
[ set-bunny-gadget-model ] keep ;

M: bunny-gadget graft* 10 10 add-timer ;

M: bunny-gadget ungraft*
dup remove-timer
bunny-gadget-dlist [ delete-dlist ] when* ;

M: bunny-gadget tick relayout-1 ;

: aspect ( gadget -- x ) rect-dim first2 /f ;

: cache-bunny-dlist ( gadget -- dlist )
dup bunny-gadget-dlist [ ] [
dup bunny-gadget-model cache-bunny
dup rot set-bunny-gadget-dlist
] ?if ;

M: bunny-gadget draw-gadget*
1.0 glClearDepth
0.0 0.0 0.0 1.0 glClearColor
45.0 over aspect 0.1 1.0 gluPerspective
0.0 0.12 -0.25 0.0 0.1 0.0 0.0 1.0 0.0 gluLookAt
GL_LEQUAL glDepthFunc
GL_LIGHT0 glEnable
GL_LIGHT0 GL_POSITION { 1.0 -1.0 1.0 1.0 } >float-array glLightfv
millis 24000 mod 0.015 * 0.0 1.0 0.0 glRotated
0.6 0.5 0.5 1.0 glColor4d
cache-bunny-dlist glCallList ;

M: bunny-gadget pref-dim* drop { 400 300 } ;

: bunny-window ( -- )
maybe-download read-model <bunny-gadget>
"Bunny" open-window ;

Linux/PPC support and assorted FFI improvements

For the last two weeks I've mostly been working on some C library interface (FFI) improvements. FFI work involves dealing with a lot of platform-specific, arcane, poorly documented details, so it is not the most fun thing to code, but it has to be done. It is not too bad, actually, because I get to learn a lot about OS and CPU architecture. So here are the recent improvements:

  • Calling C functions returning structs by value is now supported

  • Factor callbacks can return structs by value too

  • Support for stdcall callbacks on Windows

  • The Linux/ARM port now has working FFI

  • The Linux/PPC port now has working FFI

Now with these out of the way, Factor 0.88 is ready for release.

The C library interface is quite complete now. Here is what it already supports:

  • Easy to use: no glue code has to be written in C, concise

  • Relatively efficient

  • Supports both stdcall and cdecl ABIs on Windows (and as of 0.88 this support is complete)

  • Integer and floating point parameters, return values

  • Automatic conversion for ASCII and UTF16 C string parameters, returns

  • Declaring C struct and union types, allocating/deallocating/mutating C data

  • Passing struct parameters by reference and value (value struct returns were missing until 0.88)

  • Calling C functions indirectly by function pointer

  • Callbacks from C to Factor

I'm quite proud of how well it works. There are only a few missing features, and I will be adding all of them in the near future:

  • "long long" return values work on 32-bit systems, but not "long long" parameters

  • Declaring structs and unions which contain fixed-size value arrays

  • Passing floating point parameters on ARM

Apart from these I can't really think of anything else I could add to the FFI. It really does seem quite complete and portable these days.

Monday, February 26, 2007

Sudoku solver in Factor

I ported this Ocaml Sudoku solver to Factor.

The author boasts his is a "19-line (769-byte) OCaml program"; however it is actually 918 bytes. Perhaps he did not count whitespace. The Factor version is 55 lines (if one doesn't count the example at the end) and 1315 bytes. The line counts cannot be compared because I like having lots of blank lines, and short lines; but the larger byte count is a result of Factor being a bit more verbose than Ocaml for this type of stuff, and the fact that I use longer names ('board' -vs- 'm' for example) and tend to name factors instead of having long definitions.

Also I've noticed there are a few generally reusable words showing up when I write code that works on 2D arrays, like in this sudoko solver, libs/matrices/, and libs/levenshtein/. I think at some point I will take all the common code and try to come up with a library for 2-dimensional or multi-dimensional arrays.

Here is the code; its in demos/sudoku now:
! Based on
USING: sequences namespaces kernel math io prettyprint ;
IN: sudoku

SYMBOL: solutions
SYMBOL: board

: pair+ swapd + >r + r> ;

: row board get nth ;
: board> row nth ;
: >board row set-nth ;
: f>board f -rot >board ;

: row-contains? ( n y -- ? ) row member? ;
: col-contains? ( n x -- ? ) board get swap <column> member? ;
: cell-contains? ( n x y i -- ? ) 3 /mod pair+ board> = ;

: box-contains? ( n x y -- ? )
[ 3 /i 3 * ] 2apply
9 [ >r 3dup r> cell-contains? ] contains?
>r 3drop r> ;

DEFER: search

: assume ( n x y -- )
[ >board ] 2keep [ >r 1+ r> search ] 2keep f>board ;

: attempt ( n x y -- )
{ [ 3dup nip row-contains? ] [ 3drop ] }
{ [ 3dup drop col-contains? ] [ 3drop ] }
{ [ 3dup box-contains? ] [ 3drop ] }
{ [ t ] [ assume ] }
} cond ;

: solve ( x y -- ) 9 [ 1+ pick pick attempt ] each 2drop ;

: solution. ( -- ) solutions inc "Solution:" print board get . ;

: search ( x y -- )
{ [ over 9 = ] [ >r drop 0 r> 1+ search ] }
{ [ over 0 = over 9 = and ] [ 2drop solution. ] }
{ [ 2dup board> ] [ >r 1+ r> search ] }
{ [ t ] [ solve ] }
} cond ;

: sudoku ( board -- )
"Puzzle:" print dup .

0 solutions set
[ clone ] map board set

0 0 search

solutions get number>string write " solutions." print
] with-scope ;

PROVIDE: demos/sudoku ;

MAIN: demos/sudoku
{ f f 1 f f 5 3 f f }
{ f 5 f 4 9 f f f f }
{ f f f 1 f 2 f 6 4 }
{ f f f f f f 7 5 f }
{ 6 f f f f f f f 1 }
{ f 3 5 f f f f f f }
{ 4 6 f 9 f 3 f f f }
{ f f f f 2 4 f 9 f }
{ f f 3 6 f f 1 f f }
} sudoku ;

Thursday, February 22, 2007

Gram-Schmidt in Factor

Here is the Gram-Schmidt algorithm in Factor:
: proj ( v u -- w )
[ [ v. ] keep norm-sq / ] keep n*v ;

: (gram-schmidt) ( v seq -- newseq )
dupd [ proj v- ] each-with ;

: gram-schmidt ( seq -- orthogonal )
V{ } clone [ over (gram-schmidt) over push ] reduce ;

: norm-gram-schmidt ( seq -- orthonormal )
gram-schmidt [ normalize ] map ;

I coded this because I thought I needed this for something I'm working on, and it turns out I did not, so I made a weblog entry from it instead. :-) However if this ends up being useful, I'll re-implement the numerically-stable variant.

Monday, February 19, 2007

Factor ported to OpenBSD/x86

Factor in darcs now runs on OpenBSD/x86. Porting to OpenBSD on top of PowerPC, ARM and other CPU architectures is trivial but not automatic: unlike Linux, the ucontext_t structure, used in signal handlers, is CPU-specific.

You will need to install a gcc version more recent than 3.3 on your OpenBSD box. Upgrading gcc seems to be a non-trivial affair.

Also, at one time we had testers for Solaris/x86 and Linux/PPC. Not anymore, unfortunately, and recent changes to the runtime and FFI lead me to believe that Factor may no longer work on these platforms. I'm going to test Factor on Linux/PPC before releasing 0.88, but I don't have the time, inclination or hardware to maintain a Solaris/x86 install. If you're interested in seeing support, please post to the mailing list or drop by #concatenative.

Monday, February 12, 2007

MySQL bindings

Berlin Brown has contributed bindings for the MySQL database client library, along with some Factorish wrappers. Support for prepared statements is in the works. The code is in the darcs repository in libs/mysql.

Sunday, February 11, 2007

Exciting bug

I just fixed a bug in Factor which had me stumped for several hours. The problem was that on PowerPC, Factor would crash on startup if the data heap was 2 gigabytes or larger. Preliminary investigation showed that this was only a problem if the library was compiled; bootstrapping with -no-compile first yielded an image which worked fine with a 2gb data heap. And indeed, compiling something as simple as the + word would cause a crash. Now + is called a lot, so if it is miscompiled Factor doesn't survive long enough to read another line of input in the listener. So instead I used a trick to compile + but put the compiled definition in another word. Testing didn't reveal any problems, though;
0 0 blah .
1 3 blah .
... etc, with various data types, everything worked

However as soon as I swapped in the definition of blah for +, Factor would instantly crash. Some further investigation revealed this:

0 0 blah 0 =
0 0 blah 0 > .
0 0 blah class .

So this is why my earlier testing didn't reveal the problem. The sum of 0 with 0 became a corrupted bignum, which is apparently larger than 0. But indeed, 0 plus 0 should be a fixnum and not overflow to a bignum at all, much less a corrupted one.

I looked at the disassembly for + specialized to fixnum arguments; my first suspicion was some kind of singed/unsigned integer issue in the VM, but the assembly generated was identical with a 2gb heap or a typical 64mb one:
0x0516d020:     lwz     r3,-4(r14)
0x0516d024: lwz r4,0(r14)
0x0516d028: mtxer r0
0x0516d02c: addo. r5,r4,r3
0x0516d030: bns- 0x516d094
... code to handle overflow, if there is one ...
0x0516d094: stw r5,-4(r14)
0x0516d098: addi r14,r14,-4

Single-stepping through this code in gdb revealed that with a 2gb heap, the overflow branch was always taken, even if there was no overflow. The exact same code behaved correctly with the default heap size.

Then it hit me like a ton of bricks. The intention of this instruction is to store a zero in the XER register:
0x0516d028: mtxer r0
But it actually stores the value of the register r0! I'm sure I knew this when I was writing the code, but for some reason I didn't notice I had made a mistake. Well, some two years (!) later, the bug manifests itself.

Why didn't this bug appear earlier? The reason is simple. The Factor code generator doesn't use r0 as a general purpose scratch register, because its not really general purpose; some instructions assume an operand of r0 means a literal zero (perhaps I thought mtxer behaves like this). So the code generator only ever uses r0 to store return addresses in the subroutine prologue/epilogue sequences. So when + was called with a pair of fixnums, some random return address was being stuffed into XER; not zero as intended! But amazingly, everything worked, until I thought to test Factor with a larger data heap. The reason it worked is also simple; the code heap is mapped in directly after the data heap, so if the code heap was mapped large enough, storing a return address into XER had the effect of enabling the overflow bit, causing the following BNS instruction to always take the branch.

What an embarrassing bug! I hope the compiler implementation Gods can forgive me for forgetting to initialize a register.

Saturday, February 10, 2007

Automatically detecting SSE2, and other stuff

Bleeding edge Factor no longer has separate x86 boot images with and without SSE2 support. Instead, SSE2 is detected at the start of bootstrap. You can override this to produce an SSE2-less image on Pentium 4 chips with the -no-sse2 switch.

Basically we use the CPUID instruction, present on the original Pentium and above (and what if you want to run Factor on a 486? Sucks to be you! You'll need an FPU, at least, but it will work. Just pass the -no-sse2 switch to skip detection altogether). The code for this was written by Doug, with a bit of explanation on my part regarding some (presently undocumented) compiler internals. It serves as a simple demonstration of defining a new compiler intrinsic. Check it out.

... and what about the ARM port? There was a "day 4", but I didn't blog about it, because it was boring... all I did was fix some bugs and define some intrinsics. However I was doing all this in QEMU, because my Gumstix was borked (I corrupted the root fs). QEMU was too slow to be usable and it was quite frustrating to use, so I put the port aside. I ordered a serial daughterboard for my Gumstix, which arrived today, and I successfully reflashed it. So the porting work will continue, and in fact the only thing that remains is FFI support.

A major core feature I implemented in the last few days is an automatically growing data heap. You can still specify a heap size larger than the default on startup, but its not necessary because it will grow as needed.

The release of 0.88 is approaching fast. I just have to finish the ARM port and fix assorted bugs. I'm also hoping some cool upcoming libraries from contributors will be completed before the release. Chris has some mysterious uberframework which combines the Factor to JavaScript compiler with distributed concurrency and the HTTP server. Dan is working on Unicode 5.0 support. We also have AVL trees and binary heaps in the works, respectively from Alex and some other gentleman whose name escapes me right now, unfortunately.

Lots of hacking all around.

Thursday, February 01, 2007

ARM port day 3

The Factor compiler backend architecture has multiple tiers. There are several levels of compilation one can implement:
  • Basic stack shuffling, subroutine calls, branching, with most primitives being compiled as calls into C code
  • Intrinsics can be selectively defined to replace calls to primitives with open-coded assembly making use of register allocation and minimizing stack shuffling
  • C library interface and callback support

The last one is not really optional since key components of Factor (non-blocking I/O, transcendental math functions, etc) depend on the C library interface. However, traditionally I have done that part last when porting Factor to a new architecture, since it usually requires the most fiddling and is easier to test if everything else works. Open-coding certain primitives such as slot, fixnum+ can bring huge performance improvements because it eliminates branching and memory transfers from inner loops, but this is optional.

Today I only got as far as the first level of support in the ARM backend. Basically, one has to implement various words which are called by the compiler to generate branches, generate calls, load stack values to registers, write registers to the stack, and so on. There is only a small number of such words. Also, one must describe the register file of the architecture for the register allocator. While the register allocator is not used much without intrinsics, it still plays a role in the compilation of stack shuffles (which never result in calls into C, they're always open-coded). So I worked on this one file for the most part:

Despite claiming to be a RISC architecture, ARM feels more like CISC, with the multiple load/store instructions (which I haven't used yet) and complex operand and addressing modes, including automatic pre- and post- increment on memory loads and stores. This makes ARM code a pleasure to write compared to say, PowerPC. I also find it nicer than x86 because it doesn't have odd limitations and bizarre legacy cruft.

Getting relocation right was the hardest part, really. When the Factor compiler builds up a compiled code block corresponding to a word, it does not know the final address the code will be stored at; and indeed, even if it did, saving the image and restarting Factor may very well load the image at a different base address. So the compiler has to retain as much information as is needed for the VM to be able to "fix up" compiled code when the image is loaded. Relocation information is recorded in a table along with every compiled code block, and essentially consists of a list of instructions; "store the address of this word at this offset", "fix up this absolute address for the current code heap base", and so on. Since ARM instruction encoding is different from x86 and PowerPC, I had to add some new relocation instructions for updating offsets in packed ARM instruction fields.

The annoying part came when I was coding support for compiling references to literals. On x86, an immediate operand can be 32 bits; on PowerPC, only 16 bits, so a load of a 32 bit value is synthesized as a pair of instructions, a load followed by a shift-load. On ARM, only 8 bit immediates with a 4 bit rotation are supported. Constructing larger integers and addresses from 8 bit chunks is retarded; thankfully, ARM does support PC-relative addressing, so what compilers tend to do is deposit the large integers close to the compiled code (close enough for a 12 bit displacement on a load instruction), and load it via PC-relative addressing.

A similar situation occurs when one wishes to call the C VM. Calls between compiled words are not a problem, because a PC-relative branch can have up to 26 bits displacement, which is enough for a pretty large code heap. However, there is no guarantee that the Linux kernel will map the C VM text segment close enough to the mmapped code heap for a 26 bit branch to be sufficient. Again, on PowerPC, we use a pair of instructions to synthesize a 32 bit address, then jump there. On ARM, I decided to go with a simpler approach which doesn't require relocation at all; I just put the list of primitives in a global register variable, R8. Then a call to a primitive call be done by a displaced load from R8 with the constant displacement being a primitive number.

I also found a bug in the ARM assembler, added a unit test, and fixed it. I like being able to unit test assembly code, I only wish that I had the foresight to implement the post-0.85 assembler design from the very start, so that the x86 and PowerPC assemblers could be unit tested when I was working on them. Nowadays they are relative stable and rarely change, but at one point I might write unit tests for them anyway.

So after 3 days work, what I have is essentially a subroutine threaded compiler for ARM with minor optimizations. This is a rather trivial accomplishment in itself; however, tomorrow I will start writing intrinsics for the most critical primitives, and this will result in a noticeable performance boost, with the full force of the register allocator and dataflow optimizer coming into play.