Friday, December 19, 2008

Discussion of concatenative languages on various blogs

A couple of bloggers are exploring concatenative languages and have posted interesting articles, with more depth than the usual "dup * squares an integer" cursory glance:

Aaron Schaefer (elasticdog in #concatenative; he's been contributing to our math library and writing project euler solutions) has written a couple of introductory Factor articles too, and is working on a third installment.

Things have also been picking up in the #concatenative IRC channel on FreeNode, and usually there are 60-70 people online. Lots of interesting discussion going on there.

Finally, I'll take this opportunity to plug the concatenative languages wiki, and remind everyone that it running on a web framework and server written in a concatenative language. Other than an issue with our O/R mapper that we fixed recently, it has been running without any problems. It has been serving between 10,000 and 24,000 hits a day; I'd like to see it hit 10 or 20 times that number and see how it holds up.

Monday, December 15, 2008

Prevalence of shuffle words and dataflow combinators

Today on IRC, binarybandit wrote this word:
: pie-chart ( keys values -- url )
[ "|" join ] [ [ number>string ] map "," join ] bi*
"" sprintf ;

(Side note: it uses the recently-contributed printf vocabulary, which is a nice piece of work: it parses the format string uses PEGs and compiles to quotations at compile time).

This inspired me to make some pie charts showing the frequency of usages of shuffle words and dataflow combinators in the Factor library. So first, we need a word which takes a sequence of words and counts the number of usages of each, normalizing the results so that their sum is 1:
: usage-histogram ( words -- keys values )
[ [ name>> ] [ usage length ] bi ] { } map>assoc
dup values sum '[ _ /f ] assoc-map
sort-values unzip ;

Now we can use a meta-programming trick to extract the list of shufflers from the shuffle words help article:
"shuffle-words" >link uses [ word? ] filter usage-histogram pie-chart print

Here is the resulting chart:

Note that it counts the number of usages of each shuffler, so a word that calls a shuffler more than once is not counted, and a word which uses two distinct shufflers is counted twice. You can see that dup, drop and swap are by far the most popular, and the more complex patterns are rarely used. It is interesting that when beginners first start writing Factor they tend to write code which is heavily biased towards the less frequently-used shufflers. I think teaching dataflow combinators before shufflers can help here, because they make it easier to structure your code in a clean way.

For cleave/spread/apply combinators, we have to do a bit more work to get a list of them programatically, since they're spread over several help articles:
[ "cleave-combinators" "spread-combinators" "apply-combinators" ]
[ >link uses ] map concat { either? both? } diff
[ word? ] filter

Now we can make the pie chart as before. Here is the result:

Again, bi completely dominates.

I'm not sure what to make of these results, other than that the Google Charts API is pretty nifty, and that there is untapped potential for "code data mining" in Factor's reflection capabilities. We could use this to discover potential abstractions and patterns.

Thursday, December 04, 2008

Arrays of unboxed primitive values, and a faster Mersenne Twister

I added efficient arrays of integers and floats to Factor. We've had byte-arrays and float-arrays for a while; this code is essentially a generalization of these to every primitive C type: chars, shorts, ints, longs, long longs, unsigned version of all of these, as well as floats and doubles.

Specialized arrays

For each primitive C type, there is a vocabulary named specialized-arrays.type, which defines a class named type-array, and several related words: a constructor, and a conversion word for converting existing sequences to specialized arrays of this type. As with all sequence types, the class implements methods for length, nth and set-nth; there is also a literal syntax for parsing and printing specialized arrays. An example:
int-array{ 584 -13 11 } [ 10 + ] map

Specialized vectors

There are also resizeable versions of all of these, in specialized-vectors.type. These work like ordinary vectors; you can use words such as push and pop, accumulating a stream of values in a vector.

Direct arrays

A third variant is also provided. These are the "direct specialized arrays". Sometimes when working with a C library, the library hands you a pointer to some data which is to be viewed as an array of primitive values. Direct arrays accomplish this. For every primitive C type, the vocabulary provides a word named <direct-type-array> which takes an alien pointer and a length. It wraps them in a tuple which responds to the sequence protocol by reading from memory.

Memory-mapped files

Direct arrays are useful in another context, and that is when working with memory-mapped files. Again, for every primitive C type, there is a vocabulary, io.mmap.type. It defines two words, <mapped-type-array> and with-mapped-type-file. Here is an example usage of the latter: we're mapping in a file of floating point numbers and adding them up:
[ 0.0 [ + ] reduce ] with-mapped-float-file
We've had io.mmap for a while, but specialized arrays make this library a lot more useful.

This is all really powerful stuff, because on one hand, we have these high-level data types that behave like any other Factor sequence and can be used interchangeably with sequence utility words, but on the other hand, not only do they offer a space savings over heterogeneous arrays by avoiding boxing, they also have the potential to improve performance -- the low-level primitives they're built on are already optimized by the compiler, and it is possible to write code which operates on specialized float and integer arrays without any boxing at all.

Mersenne Twister

A good example of an algorithm which benefits from specialized arrays is the Mersenne Twister random number generator. The Factor implementation is very straightforward, and shorter than the C version. I changed it to use a uint-array instead of a normal array to store the random generator state. After making a few more tweaks, the time taken to generate a 100,000,000-element byte array of random bytes on my Mac Book Pro went down to 5 seconds, from 37 seconds!

Here is the main method of the Mersenne Twister, random-32*, and its surrounding utility words:
: mt-n 624 ; inline
: mt-m 397 ; inline
: mt-a HEX: 9908b0df ; inline

: mersenne-wrap ( n -- n' )
dup mt-n > [ mt-n - ] when ; inline

: wrap-nth ( n seq -- obj )
[ mersenne-wrap ] dip nth-unsafe ; inline

: set-wrap-nth ( obj n seq -- )
[ mersenne-wrap ] dip set-nth-unsafe ; inline

: calculate-y ( n seq -- y )
[ wrap-nth 31 mask-bit ]
[ [ 1+ ] [ wrap-nth ] bi* 31 bits ] 2bi bitor ; inline

: (mt-generate) ( n seq -- next-mt )
[ 2/ ] [ odd? mt-a 0 ? ] bi bitxor
] [
[ mt-m + ] [ wrap-nth ] bi*
] 2bi bitxor ; inline

: mt-generate ( mt -- )
mt-n swap seq>> '[
_ [ (mt-generate) ] [ set-wrap-nth ] 2bi
] each
] [ 0 >>i drop ] bi ; inline

: mt-temper ( y -- yt )
dup -11 shift bitxor
dup 7 shift HEX: 9d2c5680 bitand bitxor
dup 15 shift HEX: efc60000 bitand bitxor
dup -18 shift bitxor ; inline

: next-index ( mt -- i )
dup i>> dup mt-n < [ nip ] [ drop mt-generate 0 ] if ; inline

M: mersenne-twister random-32* ( mt -- r )
[ next-index ]
[ seq>> wrap-nth mt-temper ]
[ [ 1+ ] change-i drop ] tri ;

A naive implementation of Factor would struggle with this code because of method dispatch overhead (all arithmetic and sequence access is generic), and because of overflow checks on integer arithmetic (integers have arbitrary precision in Factor). Thankfully, the Factor implementation is not naive. The high-level optimizer, applied to the above code, produces the following output -- note that the translation from high-level IR back to Factor code is imprecise, and some IR nodes cannot be expressed, hence the "COMPLEX SHUFFLE" strings that appear here:
dup >r dup 3 slot dup 624 fixnum<
[ nip ] [
drop dup >r 624 swap 2 slot
>r r> >r r> 0 -rot \ ( gensym ) [
pick pick fixnum< [
swap >r 2dup "COMPLEX SHUFFLE" >r r> 2dup
"COMPLEX SHUFFLE" >r r> 3 slot
swap 4 fixnum*fast
alien-unsigned-4 2147483648 fixnum-bitand
"COMPLEX SHUFFLE" >r 1 fixnum+fast
r> >r r> 3 slot swap 4 fixnum*fast
alien-unsigned-4 2147483647 fixnum-bitand
fixnum-bitor dup >r -1 fixnum-shift-fast
r> 1 fixnum-bitand 1 eq? 2567483615 0 rot
[ drop ] [ nip ] if fixnum-bitxor
"COMPLEX SHUFFLE" >r 397 fixnum+fast
r> >r dup 624 fixnum>
[ 624 fixnum-fast ] [ ] if r> 3 slot
swap 4 fixnum*fast
alien-unsigned-4 fixnum-bitxor
"COMPLEX SHUFFLE" >r r> 3 slot
swap 4 fixnum*fast
set-alien-unsigned-4 "COMPLEX SHUFFLE" r> swap
"COMPLEX SHUFFLE" 1 fixnum+fast
"COMPLEX SHUFFLE" ( gensym )
] [ 3drop ] if
] label r> >r 0 r> 3 set-slot 0
] if r> dup >r 2 slot >r r> 3 slot swap 4 fixnum*fast
alien-unsigned-4 dup -11 fixnum-shift-fast fixnum-bitxor
dup 7 fixnum-shift-fast 2636928640 fixnum-bitand
fixnum-bitxor dup 15 fixnum-shift-fast
4022730752 fixnum-bitand fixnum-bitxor
dup -18 fixnum-shift-fast fixnum-bitxor r> dup >r 3 slot
1 fixnum+fast r> swap swap >r r> 3 set-slot

Only primitive calls remain; everything else has evaporated after 11 distinct optimization passes. A lot of redundant stack shuffling still remains; the low-level optimizer eliminates that and produces a sequence of low-level IR instructions:
=== word: mersenne-twister=>random-32*, label: mersenne-twister=>random-32*

_label 0
_label 1
##inc-r 1
##inc-d 1
##peek V int-regs 1368412 D 1
##replace V int-regs 1368412 R 0
##slot-imm V int-regs 1368417 V int-regs 1368412 3 2
##replace V int-regs 1368417 D 0
_compare-imm-branch 3 V int-regs 1368417 4992 cc>=
_label 2
##inc-d -1
##peek V int-regs 1368425 D -1
##replace V int-regs 1368425 D 0
_branch 13
_label 3
##inc-r 1
##inc-d 1
##peek V int-regs 1368427 D 2
##replace V int-regs 1368427 R 0
##load-immediate V int-regs 1368429 4992
##slot-imm V int-regs 1368434 V int-regs 1368427 2 2
##load-immediate V int-regs 1368439 0
##replace V int-regs 1368434 D 0
##replace V int-regs 1368429 D 1
##replace V int-regs 1368439 D 2
_label 4
##peek V int-regs 1368443 D 2
_compare-imm-branch 11 V int-regs 1368443 4992 cc>=
_label 5
##inc-r 7
##peek V int-regs 1368453 D 1
##peek V int-regs 1368454 D 0
##replace V int-regs 1368453 R 6
##peek V int-regs 1368456 D 2
##replace V int-regs 1368454 R 4
##replace V int-regs 1368456 R 5
##replace V int-regs 1368454 R 2
##replace V int-regs 1368456 R 3
##replace V int-regs 1368454 R 0
##replace V int-regs 1368456 R 1
##slot-imm V int-regs 1368478 V int-regs 1368454 3 2
##copy V int-regs 1368483 V int-regs 1368456
##shl-imm V int-regs 1368483 V int-regs 1368483 2
##copy V int-regs 1368486 V int-regs 1368483
##sar-imm V int-regs 1368486 V int-regs 1368486 3
##add-imm V int-regs 1368487 V int-regs 1368478 13
##add V int-regs 1368489 V int-regs 1368486 V int-regs 1368487
##alien-unsigned-4 V int-regs 1368490 V int-regs 1368489
##copy V int-regs 1368491 V int-regs 1368490
##shl-imm V int-regs 1368491 V int-regs 1368491 3
##load-immediate V int-regs 1368492 17179869184
##copy V int-regs 1368495 V int-regs 1368491
##and V int-regs 1368495 V int-regs 1368495 V int-regs 1368492
##add-imm V int-regs 1368501 V int-regs 1368456 8
##copy V int-regs 1368512 V int-regs 1368501
##shl-imm V int-regs 1368512 V int-regs 1368512 2
##copy V int-regs 1368515 V int-regs 1368512
##sar-imm V int-regs 1368515 V int-regs 1368515 3
##add V int-regs 1368518 V int-regs 1368515 V int-regs 1368487
##alien-unsigned-4 V int-regs 1368519 V int-regs 1368518
##copy V int-regs 1368520 V int-regs 1368519
##shl-imm V int-regs 1368520 V int-regs 1368520 3
##load-immediate V int-regs 1368521 17179869176
##copy V int-regs 1368524 V int-regs 1368520
##and V int-regs 1368524 V int-regs 1368524 V int-regs 1368521
##copy V int-regs 1368527 V int-regs 1368495
##or V int-regs 1368527 V int-regs 1368527 V int-regs 1368524
##copy V int-regs 1368532 V int-regs 1368527
##sar-imm V int-regs 1368532 V int-regs 1368532 4
##copy V int-regs 1368533 V int-regs 1368532
##shl-imm V int-regs 1368533 V int-regs 1368533 3
##replace V int-regs 1368533 D 2
##copy V int-regs 1368537 V int-regs 1368527
##and-imm V int-regs 1368537 V int-regs 1368537 8
##load-immediate V int-regs 1368542 20539868920
##load-immediate V int-regs 1368543 0
##replace V int-regs 1368543 D 0
##replace V int-regs 1368542 D 1
_compare-imm-branch 7 V int-regs 1368537 8 cc/=
_label 6
##inc-d -1
_branch 8
_label 7
##inc-d -1
##peek V int-regs 1368550 D -1
##replace V int-regs 1368550 D 0
_label 8
##inc-r -1
##peek V int-regs 1368551 D 1
##peek V int-regs 1368552 D 0
##copy V int-regs 1368553 V int-regs 1368551
##xor V int-regs 1368553 V int-regs 1368553 V int-regs 1368552
##replace V int-regs 1368553 D 1
##peek V int-regs 1368554 R 0
##peek V int-regs 1368555 R -1
##add-imm V int-regs 1368559 V int-regs 1368554 3176
##replace V int-regs 1368555 R 0
##replace V int-regs 1368559 D 0
_compare-imm-branch 10 V int-regs 1368559 4992 cc<=
_label 9
##peek V int-regs 1368569 D 0
##sub-imm V int-regs 1368570 V int-regs 1368569 4992
##replace V int-regs 1368570 D 0
_label 10
##inc-r -6
##inc-d 1
##peek V int-regs 1368571 R -6
##slot-imm V int-regs 1368574 V int-regs 1368571 3 2
##peek V int-regs 1368575 D 1
##copy V int-regs 1368579 V int-regs 1368575
##shl-imm V int-regs 1368579 V int-regs 1368579 2
##copy V int-regs 1368582 V int-regs 1368579
##sar-imm V int-regs 1368582 V int-regs 1368582 3
##add-imm V int-regs 1368583 V int-regs 1368574 13
##add V int-regs 1368585 V int-regs 1368582 V int-regs 1368583
##alien-unsigned-4 V int-regs 1368586 V int-regs 1368585
##copy V int-regs 1368587 V int-regs 1368586
##shl-imm V int-regs 1368587 V int-regs 1368587 3
##peek V int-regs 1368588 D 2
##copy V int-regs 1368590 V int-regs 1368588
##xor V int-regs 1368590 V int-regs 1368590 V int-regs 1368587
##peek V int-regs 1368591 R -4
##peek V int-regs 1368592 R -5
##slot-imm V int-regs 1368597 V int-regs 1368592 3 2
##copy V int-regs 1368602 V int-regs 1368591
##shl-imm V int-regs 1368602 V int-regs 1368602 2
##copy V int-regs 1368605 V int-regs 1368602
##sar-imm V int-regs 1368605 V int-regs 1368605 3
##add-imm V int-regs 1368606 V int-regs 1368597 13
##add V int-regs 1368608 V int-regs 1368605 V int-regs 1368606
##copy V int-regs 1368610 V int-regs 1368590
##sar-imm V int-regs 1368610 V int-regs 1368610 3
##set-alien-integer-4 V int-regs 1368608 V int-regs 1368610
##peek V int-regs 1368611 R -2
##peek V int-regs 1368612 R -3
##peek V int-regs 1368613 R -1
##add-imm V int-regs 1368620 V int-regs 1368611 8
##replace V int-regs 1368620 D 2
##replace V int-regs 1368612 D 0
##replace V int-regs 1368613 D 1
_branch 4
_label 11
##inc-d -3
_label 12
##inc-r -1
##inc-d 1
##peek V int-regs 1368626 R -1
##load-immediate V int-regs 1368628 0
##set-slot-imm V int-regs 1368628 V int-regs 1368626 3 2
##replace V int-regs 1368628 D 0
_label 13
##inc-r -1
##peek V int-regs 1368634 R -1
##slot-imm V int-regs 1368639 V int-regs 1368634 2 2
##slot-imm V int-regs 1368644 V int-regs 1368639 3 2
##peek V int-regs 1368645 D 0
##copy V int-regs 1368649 V int-regs 1368645
##shl-imm V int-regs 1368649 V int-regs 1368649 2
##copy V int-regs 1368652 V int-regs 1368649
##sar-imm V int-regs 1368652 V int-regs 1368652 3
##add-imm V int-regs 1368653 V int-regs 1368644 13
##add V int-regs 1368655 V int-regs 1368652 V int-regs 1368653
##alien-unsigned-4 V int-regs 1368656 V int-regs 1368655
##copy V int-regs 1368657 V int-regs 1368656
##shl-imm V int-regs 1368657 V int-regs 1368657 3
##copy V int-regs 1368661 V int-regs 1368657
##sar-imm V int-regs 1368661 V int-regs 1368661 14
##copy V int-regs 1368662 V int-regs 1368661
##shl-imm V int-regs 1368662 V int-regs 1368662 3
##copy V int-regs 1368665 V int-regs 1368657
##xor V int-regs 1368665 V int-regs 1368665 V int-regs 1368662
##copy V int-regs 1368669 V int-regs 1368665
##shl-imm V int-regs 1368669 V int-regs 1368669 7
##load-immediate V int-regs 1368670 21095429120
##copy V int-regs 1368673 V int-regs 1368669
##and V int-regs 1368673 V int-regs 1368673 V int-regs 1368670
##copy V int-regs 1368676 V int-regs 1368665
##xor V int-regs 1368676 V int-regs 1368676 V int-regs 1368673
##copy V int-regs 1368680 V int-regs 1368676
##shl-imm V int-regs 1368680 V int-regs 1368680 15
##load-immediate V int-regs 1368681 32181846016
##copy V int-regs 1368684 V int-regs 1368680
##and V int-regs 1368684 V int-regs 1368684 V int-regs 1368681
##copy V int-regs 1368687 V int-regs 1368676
##xor V int-regs 1368687 V int-regs 1368687 V int-regs 1368684
##copy V int-regs 1368691 V int-regs 1368687
##sar-imm V int-regs 1368691 V int-regs 1368691 21
##copy V int-regs 1368692 V int-regs 1368691
##shl-imm V int-regs 1368692 V int-regs 1368692 3
##copy V int-regs 1368695 V int-regs 1368687
##xor V int-regs 1368695 V int-regs 1368695 V int-regs 1368692
##replace V int-regs 1368695 D 0
##slot-imm V int-regs 1368701 V int-regs 1368634 3 2
##add-imm V int-regs 1368704 V int-regs 1368701 8
##set-slot-imm V int-regs 1368704 V int-regs 1368634 3 2

Finally, after register allocation, these instructions map pretty much directly onto machine code:
0x0000000109901940: add    $0x8,%r15
0x0000000109901944: add $0x8,%r14
0x0000000109901948: mov -0x8(%r14),%rax
0x000000010990194c: mov %rax,(%r15)
0x000000010990194f: mov 0x16(%rax),%rcx
0x0000000109901953: mov %rcx,(%r14)
0x0000000109901956: cmp $0x1380,%rcx
0x000000010990195d: jge 0x109901973
0x0000000109901963: sub $0x8,%r14
0x0000000109901967: mov 0x8(%r14),%rcx
0x000000010990196b: mov %rcx,(%r14)
0x000000010990196e: jmpq 0x109901b5b
0x0000000109901973: add $0x8,%r15
0x0000000109901977: add $0x8,%r14
0x000000010990197b: mov -0x10(%r14),%rcx
0x000000010990197f: mov %rcx,(%r15)
0x0000000109901982: mov $0x1380,%rax
0x000000010990198c: mov 0xe(%rcx),%rdx
0x0000000109901990: mov $0x0,%rcx
0x000000010990199a: mov %rdx,(%r14)
0x000000010990199d: mov %rax,-0x8(%r14)
0x00000001099019a1: mov %rcx,-0x10(%r14)
0x00000001099019a5: nop
0x00000001099019a6: nop
0x00000001099019a7: nop
0x00000001099019a8: nop
0x00000001099019a9: nop
0x00000001099019aa: nop
0x00000001099019ab: nop
0x00000001099019ac: nop
0x00000001099019ad: nop
0x00000001099019ae: nop
0x00000001099019af: nop
0x00000001099019b0: mov -0x10(%r14),%rcx
0x00000001099019b4: cmp $0x1380,%rcx
0x00000001099019bb: jge 0x109901b3a
0x00000001099019c1: add $0x38,%r15
0x00000001099019c5: mov -0x8(%r14),%rcx
0x00000001099019c9: mov (%r14),%rdx
0x00000001099019cc: mov %rcx,-0x30(%r15)
0x00000001099019d0: mov -0x10(%r14),%rcx
0x00000001099019d4: mov %rdx,-0x20(%r15)
0x00000001099019d8: mov %rcx,-0x28(%r15)
0x00000001099019dc: mov %rdx,-0x10(%r15)
0x00000001099019e0: mov %rcx,-0x18(%r15)
0x00000001099019e4: mov %rdx,(%r15)
0x00000001099019e7: mov %rcx,-0x8(%r15)
0x00000001099019eb: mov 0x16(%rdx),%rax
0x00000001099019ef: mov %rcx,%rdx
0x00000001099019f2: shl $0x2,%rdx
0x00000001099019f6: sar $0x3,%rdx
0x00000001099019fa: lea 0xd(%rax),%rbx
0x00000001099019fe: lea (%rdx,%rbx,1),%rax
0x0000000109901a02: mov (%rax),%edx
0x0000000109901a04: shl $0x3,%rdx
0x0000000109901a08: mov $0x400000000,%rax
0x0000000109901a12: and %rax,%rdx
0x0000000109901a15: lea 0x8(%rcx),%rax
0x0000000109901a19: shl $0x2,%rax
0x0000000109901a1d: sar $0x3,%rax
0x0000000109901a21: lea (%rax,%rbx,1),%rcx
0x0000000109901a25: mov (%rcx),%eax
0x0000000109901a27: shl $0x3,%rax
0x0000000109901a2b: mov $0x3fffffff8,%rcx
0x0000000109901a35: and %rcx,%rax
0x0000000109901a38: or %rax,%rdx
0x0000000109901a3b: mov %rdx,%rax
0x0000000109901a3e: sar $0x4,%rax
0x0000000109901a42: shl $0x3,%rax
0x0000000109901a46: mov %rax,-0x10(%r14)
0x0000000109901a4a: and $0x8,%rdx
0x0000000109901a4e: mov $0x4c84586f8,%rax
0x0000000109901a58: mov $0x0,%rcx
0x0000000109901a62: mov %rcx,(%r14)
0x0000000109901a65: mov %rax,-0x8(%r14)
0x0000000109901a69: cmp $0x8,%rdx
0x0000000109901a6d: jne 0x109901a7c
0x0000000109901a73: sub $0x8,%r14
0x0000000109901a77: jmpq 0x109901a87
0x0000000109901a7c: sub $0x8,%r14
0x0000000109901a80: mov 0x8(%r14),%rcx
0x0000000109901a84: mov %rcx,(%r14)
0x0000000109901a87: sub $0x8,%r15
0x0000000109901a8b: mov -0x8(%r14),%rcx
0x0000000109901a8f: mov (%r14),%rax
0x0000000109901a92: xor %rax,%rcx
0x0000000109901a95: mov %rcx,-0x8(%r14)
0x0000000109901a99: mov (%r15),%rcx
0x0000000109901a9c: mov 0x8(%r15),%rax
0x0000000109901aa0: lea 0xc68(%rcx),%rdx
0x0000000109901aa7: mov %rax,(%r15)
0x0000000109901aaa: mov %rdx,(%r14)
0x0000000109901aad: cmp $0x1380,%rdx
0x0000000109901ab4: jle 0x109901ac7
0x0000000109901aba: mov (%r14),%rdx
0x0000000109901abd: lea -0x1380(%rdx),%rax
0x0000000109901ac4: mov %rax,(%r14)
0x0000000109901ac7: sub $0x30,%r15
0x0000000109901acb: add $0x8,%r14
0x0000000109901acf: mov 0x30(%r15),%rax
0x0000000109901ad3: mov 0x16(%rax),%rdx
0x0000000109901ad7: mov -0x8(%r14),%rax
0x0000000109901adb: shl $0x2,%rax
0x0000000109901adf: sar $0x3,%rax
0x0000000109901ae3: lea 0xd(%rdx),%rcx
0x0000000109901ae7: lea (%rax,%rcx,1),%rdx
0x0000000109901aeb: mov (%rdx),%ecx
0x0000000109901aed: shl $0x3,%rcx
0x0000000109901af1: mov -0x10(%r14),%rdx
0x0000000109901af5: xor %rcx,%rdx
0x0000000109901af8: mov 0x20(%r15),%rcx
0x0000000109901afc: mov 0x28(%r15),%rax
0x0000000109901b00: mov 0x16(%rax),%rbx
0x0000000109901b04: shl $0x2,%rcx
0x0000000109901b08: sar $0x3,%rcx
0x0000000109901b0c: lea 0xd(%rbx),%rax
0x0000000109901b10: lea (%rcx,%rax,1),%rbx
0x0000000109901b14: sar $0x3,%rdx
0x0000000109901b18: mov %edx,(%rbx)
0x0000000109901b1a: mov 0x10(%r15),%rdx
0x0000000109901b1e: mov 0x18(%r15),%rbx
0x0000000109901b22: mov 0x8(%r15),%rax
0x0000000109901b26: lea 0x8(%rdx),%rcx
0x0000000109901b2a: mov %rcx,-0x10(%r14)
0x0000000109901b2e: mov %rbx,(%r14)
0x0000000109901b31: mov %rax,-0x8(%r14)
0x0000000109901b35: jmpq 0x1099019b0
0x0000000109901b3a: sub $0x18,%r14
0x0000000109901b3e: sub $0x8,%r15
0x0000000109901b42: add $0x8,%r14
0x0000000109901b46: mov 0x8(%r15),%rcx
0x0000000109901b4a: mov $0x0,%rax
0x0000000109901b54: mov %rax,0x16(%rcx)
0x0000000109901b58: mov %rax,(%r14)
0x0000000109901b5b: sub $0x8,%r15
0x0000000109901b5f: mov 0x8(%r15),%rax
0x0000000109901b63: mov 0xe(%rax),%rcx
0x0000000109901b67: mov 0x16(%rcx),%rbx
0x0000000109901b6b: mov (%r14),%rcx
0x0000000109901b6e: shl $0x2,%rcx
0x0000000109901b72: sar $0x3,%rcx
0x0000000109901b76: lea 0xd(%rbx),%rdx
0x0000000109901b7a: lea (%rcx,%rdx,1),%rbx
0x0000000109901b7e: mov (%rbx),%edx
0x0000000109901b80: shl $0x3,%rdx
0x0000000109901b84: mov %rdx,%rbx
0x0000000109901b87: sar $0xe,%rbx
0x0000000109901b8b: shl $0x3,%rbx
0x0000000109901b8f: xor %rbx,%rdx
0x0000000109901b92: mov %rdx,%rbx
0x0000000109901b95: shl $0x7,%rbx
0x0000000109901b99: mov $0x4e962b400,%rcx
0x0000000109901ba3: and %rcx,%rbx
0x0000000109901ba6: xor %rbx,%rdx
0x0000000109901ba9: mov %rdx,%rbx
0x0000000109901bac: shl $0xf,%rbx
0x0000000109901bb0: mov $0x77e300000,%rcx
0x0000000109901bba: and %rcx,%rbx
0x0000000109901bbd: xor %rbx,%rdx
0x0000000109901bc0: mov %rdx,%rbx
0x0000000109901bc3: sar $0x15,%rbx
0x0000000109901bc7: shl $0x3,%rbx
0x0000000109901bcb: xor %rbx,%rdx
0x0000000109901bce: mov %rdx,(%r14)
0x0000000109901bd1: mov 0x16(%rax),%rdx
0x0000000109901bd5: lea 0x8(%rdx),%rbx
0x0000000109901bd9: mov %rbx,0x16(%rax)
0x0000000109901bdd: retq

That's pretty tight code for a dynamic language.

Implementation of specialized arrays

So there are 13 C types here: char uchar short ushort int uint long ulong longlong ulonglong float double void*. For each C type, we have four vocabularies: specialized-arrays.type, specialized-vectors.type,, and io.mmap.type. Implementing all of this by hand would be a pain: even with the best abstractions in Factor's object system, at a minimum there would be 4 or 5 lines of code per type and per feature, and ideally you'd want to clone some methods once per implementation instead of abstracting everything out through dynamic dispatch, for performance reasons. So instead, I did what any self-respecting programmer would do in this situation: I made the computer do all the hard work. Factor lacked the abstraction necessary to express what I wanted, but Factor is extensible, so I just implemented the abstraction -- as a library. The feature we want here is essentially C++ templates. So yes, there is now a functors vocabulary which implements what amounts to C++ templates... in Factor. I called them "functors", even though they're more like templates than ML functors, because the term "template" has meaning in Factor's web framework and I didn't want to create confusion. I don't expect functors will see widespread use and I certainly don't want to promote this as a central feature of Factor. It is just another library that happens to extend the parser -- like many other Factor libraries do.

Let's look at how functors are used. For example, here is the entire text of the vocabulary:

<< "int" define-direct-array >>

The key of course is that its invoking the define-direct-array word, at parse time (hence the << and >>). This word is defined in the vocabulary, and its definition looks quite unusual:
FUNCTOR: define-direct-array ( T -- )

A' IS ${T}-array
>A' IS >${T}-array
<A'> IS <${A'}>

A DEFINES direct-${T}-array
<A> DEFINES <${A}>

NTH [ T dup c-getter array-accessor ]
SET-NTH [ T dup c-setter array-accessor ]


{ underlying alien read-only }
{ length fixnum read-only } ;

: <A> ( alien len -- direct-array ) A boa ; inline
M: A length length>> ;
M: A nth-unsafe underlying>> NTH call ;
M: A set-nth-unsafe underlying>> SET-NTH call ;
M: A like drop dup A instance? [ >A' execute ] unless ;
M: A new-sequence drop <A'> execute ;

INSTANCE: A sequence

This FUNCTOR: ... IS / DEFINES ... WHERE ... ;FUNCTOR business is not new syntax in the language; its just parsing words defined in the functors vocabulary. Essentially we are defining a word, define-direct-array, which generates code, parameterized by a single input T, which is a C type name in this case. Note that what's going on here is more elaborate than text substitution; the functor body is parsed once, and converted into code which instantiates it, rather than simply expanding at usages like C preprocessor macros do.

Functors may seem like an elaborate new piece of functionality, comparable in complexity to the object system or locals, but in reality, the emperor has no clothes; they have a trivial desugaring.

First, inside a functor body, the TUPLE:, :, M: and INSTANCE: are shadowed to special words in the functors vocabulary named `TUPLE:, `:, and so on.

Next, IS and DEFINES desugar as follows.
A'      IS ${T}-array
expands into
A' [ [ I[ ${T}-array]I ] with-string-writer dup search [ ] [ no-word ] ?if ]
<A>     DEFINES <${A}>
expands into
<A> [ [ I[ <${A}>]I ] with-string-writer create-in ]

The I[ syntax is just the interpolate library, which turns this,
I[ Hello, ${name}.]I
into this,
"Hello, " write name write "." write
FUNCTOR: foo ( inputs -- )




is really just syntax sugar for
:: foo ( inputs -- )
[let* | ...names... | ...body... ] ;

With the added caveat that inside the body, `: is remapped to : and so on.

As you can see, a seemingly powerful feature is melting away as it becomse apparent its only some rather trivial syntax sugar. Words such as `: desugar like so;
`: foo 2 2 + . ;
foo [ 2 2 + . ] define
Let's look at a functor and its desugaring. Here is a simple functor, it takes a tuple class name as input, and defines a constructor word <foo> which makes a new instance of this tuple class using new. So it is like the built-in parsing word C:, except it uses new instead of boa:
FUNCTOR: make-constructor ( name -- )

tuple IS ${name}
<tuple> DEFINES <${name}>


: <tuple> tuple new ;


Given the above, this functor is really equivalent to the following:
:: make-constructor ( name -- )
[let* | tuple [ name dup search [ ] [ no-word ] ?if ]
<tuple> [ [ "<" write name write ">" write ] with-string-writer ] |
<tuple> [ tuple new ] define
] ;

We could keep going -- the :: word, which defines a word with named parameters, is it self a library word, in the locals vocabulary, which desugars down to stack code.

In this simple example, the hand-coded version is shorter than the functor, but the point of using a functor is that you can make the meta-code look like the code that it is generating. For the more complex functors that generate the various forms of specialized arrays, this is invaluable because it makes the code easier to maintain and debug.

Finally, the stuff described in this post, especially the implementation of functors, is some pretty advanced stuff. I don't expect most people using Factor to know or care about how this works; after all, you can use specialized arrays without "peeking under the hood", so to speak. However, all of this meta-programming power is there when you are ready for it.

Monday, December 01, 2008

Factor now fully supported on 64-bit Windows

The 64-bit Windows porting effort is complete. All vocabularies load and all tests pass (except PostgreSQL doesn't have 64-bit client libraries). A binary package is available. This brings the total number of supported platforms to 13. Enjoy!

Sunday, November 30, 2008

New features in SMTP and SSL library allow sending e-mail through Gmail

Although I added SSL support to Factor's I/O library recently, it was missing a feature needed by SMTP-TLS: upgrading insecure connections to secure ones. Unlike HTTPS, where the client connects to a special port number initiates a secure handshake as soon as the connection is established, SMTP uses a STARTTLS command, which is sent in plain-text, and handshaking only begins after this command is sent.

To accomodate this use-case, I added two new words to the vocabulary: send-secure-handshake and accept-secure-handshake. They are meant to be used in a client and server respectively, wishing to upgrade the current connection to a secure connection.

I also updated the smtp vocabulary to support the STARTTLS and AUTH PLAIN features of the SMTP protocol. The end result is that the SMTP vocabulary can now send e-mail through Gmail's SMTP servers. An example is shown below. The key thing to notice here is that we set the smtp-auth and smtp-tls? variables before sending the e-mail. As you can see, the SMTP library is easy to use and the code for sending an e-mail is succinct; we create an e-mail object, fill in some slots, and send it off:

"" 587 <inet> smtp-server set
smtp-tls? on
"" "YourPassword" <plain-auth> smtp-auth set

"" >>from
{ "" } >>to
"We should probably beer" >>subject
"For teh win" >>body

The SMTP library is still missing some functionality, like support for character encodings, attachements and setting custom headers; I'll be adding those over time.

Saturday, November 29, 2008

Faster generic arithmetic and integer overflow checking

While the Factor compiler does a pretty good job of inferring types and eliminating dispatch where possible, runtime polymorphism still ends up being used a lot. Integer arithmetic presents an additional challenge. Factor supports arbitrary-precision integers, and it tries to make this support transparent. However, the flipside of this is that even if the compiler can figure out that all of the values in an expression are integers, it may not be able to eliminate the overflow check, depending on whether or not the ranges of the values are known. Because of this, making runtime polymorphism efficient is still important. In the last few days, I worked on making integer arithmetic more efficient.

Consider a word like +. If compile-time type information is not available, this word has to look at the types of the top two stack items, and take appropriate action depending on whether they are fixnums, bignums, ratios, floats, or complex numbers. Until now, this dispatch was implemented by two indirect jumps. The type of the first object is used to key a dispatch table, where each entry then looks at the type of the second object and keys into a second-level dispatch table; the entries in the latter are the actual methods implementing the arithmetic operation.

In Factor, the generic word system generates Factor code for performing method dispatch. If you look at the definition of a word such as + in an older build, you'll see this two-level jump table pattern, implemented with Factor's dispatch combinator:
( scratchpad ) \ + def>> .
over tag {
dup tag {
[ fixnum=>+ ]
[ [ >bignum ] dip bignum=>+ ]
[ \ + no-math-method ]
[ \ + no-math-method ]
[ ratio=>+ ]
[ [ >float ] dip float=>+ ]
[ complex=>+ ]
[ \ + no-math-method ]
} dispatch
dup tag {
[ >bignum bignum=>+ ]
[ bignum=>+ ]
[ \ + no-math-method ]
[ \ + no-math-method ]
[ ratio=>+ ]
[ [ >float ] dip float=>+ ]
[ complex=>+ ]
[ \ + no-math-method ]
} dispatch
[ \ + no-math-method ]
[ \ + no-math-method ]
dup tag {
[ ratio=>+ ]
[ ratio=>+ ]
[ \ + no-math-method ]
[ \ + no-math-method ]
[ ratio=>+ ]
[ [ >float ] dip float=>+ ]
[ complex=>+ ]
[ \ + no-math-method ]
} dispatch
dup tag {
[ >float float=>+ ]
[ >float float=>+ ]
[ \ + no-math-method ]
[ \ + no-math-method ]
[ >float float=>+ ]
[ float=>+ ]
[ complex=>+ ]
[ \ + no-math-method ]
} dispatch
dup tag {
[ complex=>+ ]
[ complex=>+ ]
[ \ + no-math-method ]
[ \ + no-math-method ]
[ complex=>+ ]
[ complex=>+ ]
[ complex=>+ ]
[ \ + no-math-method ]
} dispatch
[ \ + no-math-method ]
} dispatch

The dispatch combinator compiles down to an indirect jump, and indirect jumps are expensive on modern CPUs. Also, note that this dispatch scheme did not favor any single numeric type over another. However, in practice, operations on bignums and ratios will always be slow, and dispatch overhead is neglegible there; whereas code that works with floating point numbers and complex numbers can only be fast if the compiler can infer types and unbox values statically, in which case there is no runtime dispatch at all. So performance of runtime arithmetic dispatch on types other than fixnums is largely irrelevant in Factor. Indeed, the overwhelming majority of calls to generic arithmetic operations involve fixnums, so some kind of trick to avoid two indirect jumps in this common case is needed.

There are many solutions, but the one I settled on is pretty much the simplest one. In Factor, all values are tagged pointers (except for unboxed floats and such, which the compiler uses, but they cannot be "observed" in their unboxed state from the outside). A tagged pointer is a value where the low 3 bits encode a type, and the remaining bits encode a value. If the type tag is 0, the value is interpreted as an integer; otherwise, it is a pointer. All values in the heap are aligned on 8 byte boundaries, so if we mask off the 3 tag bits, we get a valid pointer.

So if we have two values, stored in registers x and y, then both have a tag of 0 if their inclusive bitwise or has a tag of zero. That is, we can check if both are fixnums with a single conditional.

Here is the new Factor code generated for the dispatch performed by +:
( scratchpad ) \ + def>> .
[ { fixnum fixnum } declare fixnum=>+ ] [
over tag {
dup tag {
[ { fixnum fixnum } declare fixnum=>+ ]
{ fixnum bignum } declare
[ >bignum ] dip bignum=>+
{ fixnum tuple } declare
\ + no-math-method
[ \ + no-math-method ]
[ { fixnum ratio } declare ratio=>+ ]
{ fixnum float } declare [ >float ] dip
[ { fixnum complex } declare complex=>+ ]
{ fixnum POSTPONE: f } declare
\ + no-math-method
} dispatch
dup tag {
{ bignum fixnum } declare
>bignum bignum=>+
[ { bignum bignum } declare bignum=>+ ]
{ bignum tuple } declare
\ + no-math-method
[ \ + no-math-method ]
[ { bignum ratio } declare ratio=>+ ]
{ bignum float } declare [ >float ] dip
[ { bignum complex } declare complex=>+ ]
{ bignum POSTPONE: f } declare
\ + no-math-method
} dispatch
[ \ + no-math-method ]
[ \ + no-math-method ]
dup tag {
[ { ratio fixnum } declare ratio=>+ ]
[ { ratio bignum } declare ratio=>+ ]
{ ratio tuple } declare
\ + no-math-method
[ \ + no-math-method ]
[ { ratio ratio } declare ratio=>+ ]
{ ratio float } declare [ >float ] dip
[ { ratio complex } declare complex=>+ ]
{ ratio POSTPONE: f } declare
\ + no-math-method
} dispatch
dup tag {
{ float fixnum } declare
>float float=>+
{ float bignum } declare
>float float=>+
{ float tuple } declare
\ + no-math-method
[ \ + no-math-method ]
[ { float ratio } declare >float float=>+ ]
[ { float float } declare float=>+ ]
[ { float complex } declare complex=>+ ]
{ float POSTPONE: f } declare
\ + no-math-method
} dispatch
dup tag {
[ { complex fixnum } declare complex=>+ ]
[ { complex bignum } declare complex=>+ ]
{ complex tuple } declare
\ + no-math-method
[ \ + no-math-method ]
[ { complex ratio } declare complex=>+ ]
[ { complex float } declare complex=>+ ]
[ { complex complex } declare complex=>+ ]
{ complex POSTPONE: f } declare
\ + no-math-method
} dispatch
[ \ + no-math-method ]
} dispatch
] if

Notice that it is essentially the same as the old version, except the two-level jump table has been wrapped in a conditional. If both-fixnums? outputs true, we go directly to the fixnum case; otherwise we do the two-level dispatch.

Since both-fixnums? needs to bitwise OR pointers together, I had no choice but to make it a VM primitive. The non-optimizing compiler inlines a fixed code sequence for calls of both-fixnums?, and the optimizing compiler expands it into a series of low-level IR operations:
( scratchpad ) [ both-fixnums? [ "Fixnums!" ] [ "At least one is not a fixnum" ] if ] test-mr mr.
=== word: ( gensym ), label: ( gensym )

_label 0
_label 1
##peek V int-regs 692262 D 0
##peek V int-regs 692263 D 1
##copy V int-regs 692264 V int-regs 692262
##or V int-regs 692264 V int-regs 692264 V int-regs 692263
##copy V int-regs 692265 V int-regs 692264
##and-imm V int-regs 692265 V int-regs 692265 7
_compare-imm-branch 3 V int-regs 692265 0 cc/=
_label 2
##inc-d 1
##load-indirect V int-regs 692269 "Fixnums!"
##replace V int-regs 692269 D 0
_label 3
##inc-d 1
##load-indirect V int-regs 692270 "At least one is not a fixnum"
##replace V int-regs 692270 D 0
_label 4

So the machine code generated for a generic arithmetic word such as + now begins as follows:
    mov    (%r14),%rax
mov -0x8(%r14),%rcx
or %rcx,%rax
and $0x7,%rax
cmp $0x0,%rax
jne slow_dispatch
... handle fixnum+fixnum here ...
... do table dispatch here ...

We load two values from the data stack, or them together, mask off the low 3 bits, compare with zero, and jump. Indeed, those readers who know the ways of x86 assembly will recognize that there is a shorter code sequence which can accomplish this same task:
    mov    (%r14),%rax
mov -0x8(%r14),%rcx
or %rcx,%rax
test $0x7,%rax
jnz slow_dispatch
... handle fixnum+fixnum here ...
... do table dispatch here ...

Once I figure out a nice way to incorporate the test instruction into my code generator and value numbering passes, I will switch things up and eliminate the unnecessary instruction there.

This takes care of making dispatch faster. The next challenge was speeding up integer overflow handling. If the optimizing compiler's high-level optimizer can prove that a call to + has fixnum inputs, and in addition, the result will not overflow -- either because the ranges of the input values are constrained, or the output is passed to bitand -- it will replace it with a call to the fixnum+fast primitive. The low-level optimizer turns a call to fixnum+fast into a single low-level IR instruction, and this becomes a single machine instruction. Very efficient.

However, if the high-level optimizer cannot prove that the result will not overflow, but it does know the inputs are fixnums, it replaces the call to + with a call to fixnum+. This still avoids dispatch overhead, however, until now, the low-level optimizer did not do anything special for fixnum+; it would compile it as a subroutine call of a VM function, which added two integers, checking for overflow, in C. This was slow, because C does not have a way to test the overflow flag of the CPU.

I made the low-level optimizer aware of the fixnum+ word; it converts it into a low-level instruction which can then participate in value numbering and register allocation. The CPU-specific backend is then responsible for generating an optimal machine code sequence for fixnum+. On x86 and PowerPC, this can use the overflow flag in the CPU, which is a more efficient than trying to simulate the same in C. If the overflow flag is set after the addition, fixnum+ performs a subroutine call into the VM, however there is no call and the arithmetic is done inline in the common case, where there is no overflow.

Here is an example of machine code emitted for fixnum+:
    sub    $0x8,%r14
mov (%r14),%rax
mov 0x8(%r14),%rcx
add %rcx,%rax
jno no_overflow
... handle overflow ...
mov %rax,(%r14)

I omitted the lengthy setup for the subroutine call into the VM, and register assignments will vary depending on the placement of fixnum+ in the code.

The results were very interesting. Some benchmarks, such as spectral-norm and mandelbrot, showed no improvement, because the compiler eliminates all the dispatch and overflow checks from the inner loops anyway. Other benchmarks showed an impressive gain. This is good because it means that I didn't waste my time with implementing the above, but it is also bad because it shows that the high-level optimizer could probably do a better job of inferring types in some of these benchmarks!

Here are the results; I'm running each benchmark 5 times and measuring the runtime in milliseconds.
fasta9492 8985 8975 9310 92197841 7818 8115 7807 7777
reverse-complement5815 5611 5628 5645 56455176 5152 5177 5173 5289
binary-trees2911 2867 2870 2890 28942163 2142 2186 2168 2136
fib2113 112 114 112 11282 83 84 85 85
fib3301 302 301 300 301260 261 259 260 260
project-euler.134427 427 427 426 424314 313 315 318 313

Wednesday, November 26, 2008

Some recent UI rendering fixes

Over the last week or so I've fixed several OpenGL rendering problems in the Factor UI.

The first problem is that on Linux, with certain ATI video cards, calling glPolygonMode() to disable polygon fill does not work, and as a result the UI would render with no text visible; borders around buttons would be drawn as filled rectangles and this would overwrite all of the text. I fixed this and modernized the UI rendering code to use vertex arrays at the same time.

Then, I decided to address a long-standing issue: rectangles and lines were not rendered pixel-perfect on all systems. You see, with OpenGL, simply rendering a rectangle with integer co-ordinates doesn't give you the results you would expect. On my old Windows laptop for example, rectangles would render with the bottom-right pixel missing. Getting this to work in a cross-platform manner is surprisingly hard. It took me a lot of back-and-forth to get consistent rendering across the machines I tested on (An Intel iMac from a year ago or so, my MacBook Pro with NVidia graphics, Linux and Windows in VirtualBox, and an old Dell Laptop with Windows XP and Intel integrated graphics). For the record, the "correct" way to draw an outlined rectangle is to offset the top-left pixel by 0.5,0.5, and the bottom-right pixel by -0.3,-0.3. Similar heroics were required for line rendering. In the end, I made my job semi-automated by writing a simple program, which is in the ui.render.test vocabulary now, that renders some 2D shapes with OpenGL;

The vocabulary renders the shapes and then compares the results with a reference drawing. It reads the rendered pixels with glReadPixels() and loads the drawing with Doug's graphics.bitmap library.

Of course, it turned out that vertex arrays themselves have issues on at least one OpenGL implementation. I started to receive reports of random segfaults on older MacBooks with the Intel X3100 graphics chip. Chris Double then came across a mailing list post which explained the situation. Turns out that with this driver, rendering a GL_LINE_LOOP vertex array causes a segfault. The workaround is to render a GL_LINE_STRIP where the last vertex is a copy of the first vertex.

There are still a couple of OpenGL problems I haven't been able to resolve. A few users report random screen corruption on Windows machines with Intel graphics. There is also a problem where the Factor UI comes up black if used with Compiz on Linux, however this appears to affect any GL app -- even glxgears. The workaround is to enable indirect rendering with Compiz.

It has been three years since I first ported Factor's UI to OpenGL and I'm still finding new issues in drivers that need workarounds. It's a bit disappointing, but it does seem that in recent times, OpenGL driver quality is improving. I still think OpenGL is a good way to render graphics in a cross-platform manner; it has plenty of features and even runs on mobile platforms (OpenGL ES). Our UI only needs a tiny bit of platform-specific code for each platform: opening windows, creating GL contexts, receiving events, and working with the clipboard. Rendering is cross-platform.

A few people have suggested using Cairo instead, and while Factor has a Cairo binding now, I'm not keen on using it to render the entire UI. I worry that by switching from OpenGL to Cairo, we'll just have to replace one set of bugs and workarounds with another. Another problem is that Cairo would be another external dependency that deployed Factor applications would have to ship. It seems Cairo is quite bloated these days, so that would be unacceptable. One other alternative is to use native APIs: perhaps Cairo rendering on Linux, GDI+ on Windows, and CoreGraphics on Mac. However, this would require too much effort on my part, so its not something I'm willing to dive into at this point.

Monday, November 24, 2008

Comparing the performance of Factor and V8

V8 is, of course, Google's new, much-hyped JavaScript VM. It combines a simple JIT with inline caching, and along with SquirrelFish, it promises to take JavaScript performance to the next level.

Factor does not do inline caching (although this is planned), however the language itself is more static and the compiler performs more advanced optimizations than the V8 JIT. So I thought it would be interesting to see how these two approaches to dynamic language implementation stack up.

I compiled the V8 trunk today on my Intel Core 2 Duo. Factor is of course the latest Factor from git.

I chose the following benchmarks from the language shootout:
  • spectral-norm
  • binary-trees
  • fasta
  • nsieve
  • nsieve-bits
  • partial-sums

Unfortunately, I couldn't get reverse-complement or a few others to work in V8, because they use the readline() function which does not appear to exist. It would have been interesting to compare more benchmarks.

I got the JavaScript versions from the shootout website; the Factor versions are in the Factor repository.

I ran the V8 benchmarks from the command line, like so,
./shell foo.js
And I ran the Factor benchmarks from within the Factor environment.

Without further ado, and without any attempt at post-mortem or analysis, here are the results. All running times are in seconds:
binary-trees 3.13.8
fasta 8.912
nsieve 0.61.3
nsieve-bits 1.56.9
partial-sums 2.62.3
spectral-norm29 85

Sunday, November 23, 2008

Factor presentation at OTUG

I will be giving a talk about Factor at OTUG, The Object Technology User's Group of Minneapolis/St Paul on December 16th. If you're in the Twin Cities area and are interested in Factor, come check it out! There will be free pizza afterwards.

Tuesday, November 11, 2008

Factor port to Win64 in progress

A year ago I tried to get a working gcc setup on 64-bit Windows with zero success. This time around, I had better luck; the MinGW-Win64 project has a functional gcc port out and I was able to use it to compile the Factor VM.

The main hurdle to overcome is that Win64 uses a different calling convention from Linux/x86-64 and Mac OS X/x86-64. I got the non-optimizing compiler generating valid code, and set things up so that we now have two x86.64 boot images, boot.winnt-x86.64.image and boot.unix-x86.64.image, containing different non-optimizing compiler backends (most of the backend is shared, but some parameters are configured differently). After some messing around, I got it to load various modules, including the optimizing compiler.

The main thing that tripped me up at this stage is that I did not have a working gdb, so certain problems were very hard to debug. For example, on my first reading of the Win64 calling conventions, I missed out that you must reserve 32 bytes in your stack frame for the callee to mess with, and if you put non-volatile data there, then calling any function will clobber your stack frame. This manifested in a most bizarre fashion, namely Factor would work if the VM was with -O3 and crash if compiled with -g; usually we get the opposite, when we find gcc bugs. This time it was a bug in Factor, and the reason it would only crash with optimizations disabled is that the gcc optimizer would use registers to store values, instead of scribbling over the volatile portion of the caller's stack frame. A day of lost work later, I finally figured out the problem, and it was much smoother from thereon in.

Next, I started updating the optimizing compiler backend. Again, the bulk of the work was with figuring out the calling convention. Win64 passes and returns structures differently from the other x86.64 platforms, so I to add some OS-specific hooks to the x86.64 optimizing compiler backend too.

Overall, it seems like Microsoft's calling convention is simpler overall than the one used by Linux and Mac OS X, especially as far as passing structures by value is concerned. Other than the initial hurdle figuring out how the reserved scratch space in stack frame works, I was able to get things working fairly quickly.

The tests all pass now, however testing some real C library bindings still reveals problems (hence the FFI tests are not exhaustive enough and need to be amended). I also need to do a lot of code cleanup before any of this is in a releasable state, so don't get your hopes up yet -- it will be a few more days before you can build Factor from git on Win64. Binary packages will follow soon after; this is mostly conditional on getting various external packages installed, such as PostgreSQL, SQLite, OpenSSL, and gdb. This is due to the fact that the build farm tests our most important C library bindings, and doesn't release anything if the tests fail (which is quite nice; if a bug in the C library binding prevents something important like OpenSSL from working, you really don't want users to come along and download the result).

So while I work on finishing the port in the meantime, here is a screenshot from when I got the FFI just far along enough to render, in a very broken fashion (no text!) the UI workspace and a couple of demos.

Win64 seems rather neglected by the open source community; even gcc is only available in the experimental MinGW snapshots. However I think supporting it is important; moving to x86.64 can offer a performance boost from the additional registers and RIP-relative addressing mode, and the ability to address more than 4GB of RAM is going to be more and more important too. Very soon now Factor will be one of the few (or the only?) open source, natively-compiled, dynamic languages which can run on 64-bit Windows as well as 12 other platforms. Exciting times ahead.

Thursday, November 06, 2008

Further performance gains

One thing I've learned over the last few months is that minor improvements to performance, accumulated over time, are just as important as big architectural changes which result in big gains. With the new code generator in place, I spent some time working on a few tweaks I've been thinking about for a while.

Tuple dispatch overview

First up, I sped up method dispatch on tuples by eliminating some unnecessary conditional branches and memory accesses. In Factor, the object system is implemented in the language, and so all dispatch is done by generated Factor code. This code is generated at compile time. For example, we can define a tuple,
TUPLE: dog name breed ;

and look at the code generated for the breed accessor:
( scratchpad ) \ breed>> def>> .
dup tag dup 2 eq? [
drop dup { tuple } declare 1 slot { array } declare
7 slot dup \ dog eq?
[ drop { dog } declare dog=>breed>> ]
[ drop object=>breed>> ] if
] [ drop object=>breed>> ] if

As you can see above, the generated code uses various low-level primitives to accelerate dispatch. Usually you will never work with these primitives directly. The words named class=>generic are not real words in the dictionary; they are word objects corresponding to methods. So methods are essentially words, except they're not part of any vocabulary. Again, this is an implementation detail and you don't need to be aware of how methods are implemented behind the scenes. In the image where I defined this tuple, I had no other tuples with a slot named breed, so the generic accessor only has one method, and the code generated for a generic with one method is very simple. For generics with many methods, more complex code is generated; if you have enough methods, a hashtable dispatch is generated.

The general philosophy of method dispatch on tuples is as follows. To each tuple, we associate an integer which is the length of the superclass chain from the tuple up to the root class of tuples, tuple. I call this the tuple's "echelon". Every tuple instance in the heap holds a pointer to a tuple layout object in its first slot. The tuple layout object holds the echelon, the sequence of superclasses, and the tuple size.

The tuple class methods of a generic word are sorted into echelons, and dispatch proceeds by starting at the highest echelon and going down until a suitable method is found (or an error is thrown). At each echelon, the generic word looks at the corresponding entry in the superclass list of the tuple on the stack, and performs a hashtable lookup. So dispatch runs in O(n) time, where n is the maximum echelon number of the tuple classes participating in the generic word.

For example, suppose we have the following inheritance hierarchy -- its contrieved, because in practice you probably would not have square inherit from rectangle, but it demonstrates a point here:

Now, suppose we have a generic word:
GENERIC: draw ( shape -- )

M: text draw ... ;
M: rectangle draw ... ;
M: square draw ... ;
M: parallelogram draw ... ;
M: circle draw ... ;
M: triangle draw ... ;

Note that if we pass a square instance to draw, we want it to invoke the M: rectangle method. Here are the participating classes, sorted into echelons:
level 1: text
level 2: circle
level 3: triangle
level 4: rectangle, parallelogram

Method dispatch proceeds as follows:
  • If the tuple on the stack has echelon >= 4, we get the 4th element in its superclass chain, and check if it's rectangle or parallelogram. If so, we dispatch to that method. Note that the 4th element of the superclass chain of both rectangles and squares is rectangle.
  • If the tuple on the stack has echelon >= 3, we get the 3rd element and check if it's a triangle. If so, we dispatch to that method.
  • If the tuple on the stack has echelon >= 2, we get the 2nd element and check if its a circle. If it is, we dispatch to that method.
  • If the tuple on the stack has echelon >= 1, we get the 1st element and check if it's text. If so, we dispatch to that method.

For this generic word, a maximum of four tests must be performed because the inheritance hierarchy is very tall. This situation is rare, since for the most part inheritance is very flat.

Method dispatch: removing conditionals and indirection

The first thing I realized is that every tuple has an echelon level of at least 1, since it must have itself and tuple in the superclass chain. So the conditional test on the final step is unnecessary. Furthermore, if all tuples have echelon 1, there is no need to even load the echelon of the tuple on the stack into a register.

Next, I decided to put the superclass list inside the tuple layout object itself, instead of having the tuple layout object reference an array. This removes one level of indirection, which reduces CPU cache pollution.

To illustrate these improvements, look at the code generated for the call word before these improvements:
dup tag dup 3 eq? [
drop dup 0 slot 8 fixnum-fast dup 6 eq?
[ drop { quotation } declare quotation=>call ]
[ drop object=>call ] if
] [
dup 2 eq? [
drop dup { tuple } declare 1 slot
{ tuple-layout } declare 5 slot dup 1 fixnum>= [
drop dup { tuple } declare 1 slot
{ tuple-layout } declare 4 slot
{ array } declare 3 slot { word } declare
dup \ curry eq?
[ drop { curry } declare curry=>call ] [
dup \ compose eq?
[ drop { compose } declare compose=>call ]
[ drop object=>call ] if
] if
] [ drop object=>call ] if
] [ drop object=>call ] if
] if

And after:
dup tag dup 2 eq? [
drop dup { tuple } declare 1 slot { array } declare
7 slot dup \ compose eq?
[ drop { compose } declare compose=>call ] [
dup \ curry eq?
[ drop { curry } declare curry=>call ]
[ drop object=>call ] if
] if
] [
dup 3 eq? [
drop dup { hi-tag } declare 0 slot dup 14 eq?
[ drop { quotation } declare quotation=>call ]
[ drop object=>call ] if
] [ drop object=>call ] if
] if

Method dispatch: faster hashcode lookup

If a generic word has more than 4 methods at the same echelon level, a hashtable dispach is generated instead of a series of linear tests. Formerly, the generated code would indirect through the class word to look up the hashcode. Again, this is bad for locality because it pulls in the entire word object's cache line in for no good reason. So what I did was intersperse the hashcodes with the superclass chain in the tuple layout object. The tuple layout object will already be in the cache, since we just read one of the superclass chain entries, so reading the hashcode is essentially free.

Method dispatch: re-ordering tests

I noticed is that the >fixnum generic word had conditionals in an essentially random order. The cut-off between a series of conditionals and a jump table is 4 methods, and >fixnum, defining methods on every subtype of the reals, is the borderline case:
( scratchpad ) \ >fixnum def>> .
dup tag dup 5 eq?
[ drop { float } declare float=>>fixnum ] [
dup 4 eq?
[ drop { ratio } declare ratio=>>fixnum ] [
dup 0 eq?
[ drop { fixnum } declare fixnum=>>fixnum ] [
dup 1 eq?
[ drop { bignum } declare bignum=>>fixnum ]
[ drop object=>>fixnum ] if
] if
] if
] if

For various reasons, most >fixnum calls are made with a fixnum on the stack. By sorting the methods by tag number before generating the generic, I was able to get it to look like this:
( scratchpad ) \ >fixnum def>> .
dup tag dup 0 eq?
[ drop { fixnum } declare fixnum=>>fixnum ] [
dup 1 eq?
[ drop { bignum } declare bignum=>>fixnum ] [
dup 4 eq?
[ drop { ratio } declare ratio=>>fixnum ] [
dup 5 eq?
[ drop { float } declare float=>>fixnum ]
[ drop object=>>fixnum ] if
] if
] if
] if

This reduces the number of conditionals in the common case.

I/O system tweaks and string-nth intrinsic

I made some changes to io.ports and io.buffers; mostly adding inline declarations. I also added a hint to the push word for the sbuf class; stream-readln would push every character onto an sbuf, and adding this fast-path eliminated some method dispatch. Finally, I made string-nth a compiler intrinsic rather than a call into the VM, which means that words where the compiler infers you're operating on strings will inline the intrinsic and perform less memory loads/stores and subroutine calls than before.

Eliminating useless branches with value numbering

Take a look at what the high-level optimizer does to the following quotation:
( scratchpad ) [ [ { + - * } member? ] contains? ] optimized.
dup length swap 0 -rot \ ( gensym ) [
pick pick < [
>r 2dup
>r nth-unsafe dup \ * eq? [ t ] [ f ] if
[ drop t ] [
dup \ - eq? [ t ] [ f ] if
[ drop t ]
[ \ + eq? [ t ] [ f ] if [ t ] [ f ] if ] if
] if r>
>r rot r>
[ 2drop ]
[ >r >r 1 +-integer-fixnum r> r> ( gensym ) ] if
] [ 3drop f ] if
] label dup [ ] [ ] if [ t ] [ f ] if

A lot of dispatch is eliminated and methods are inlined, but there is some pretty crappy control flow redundancy there. I extended the low-level optimizer to eliminate it when building the low-level IR.

One of the worst offenders is the = word, defined as follows:
: = ( obj1 obj2 -- ? )
2dup eq? [ 2drop t ] [ equal? ] if ; inline

If obj2 is a word, then equal? expands into 2drop f, so we get
2dup eq? [ 2drop t ] [ 2drop f ] if

Then, dead code elimination converts this to
eq? [ t ] [ f ] if

Ideally, we'd get rid of [ t ] [ f ] if altogether. Instead of doing this in the high-level optimizer, I decided to detect this code pattern, along with its negation [ f ] [ t ] if (which is the inlined definition of not) and emit a branchless comparison, instead:
( scratchpad ) [ \ + = ] test-mr mr.
=== word: ( gensym ), label: ( gensym )

_label 0
_label 1
##load-indirect V int-regs 582552 +
##peek V int-regs 582553 D 0
##compare V int-regs 582555 V int-regs 582553 V int-regs 582552 cc=
##replace V int-regs 582555 D 0

Now, suppose you have a code sequence like [ t ] [ f ] if [ 1 ] [ 2 ] if after inlining. Value numbering builds up an expression chain where one comparison depends on the result of another, and it is able to collapse them down to a single comparison.

Finally, the empty conditional, [ ] [ ] if, manifests in value numbering as a conditional branch instruction where both successors point to the same node. This is replaced with an unconditional branch.


Here are the timings, in milliseconds, for 10 runs of the reverse-complement benchmark:
5309 5310 5342 5297 5305 5635 5344 5313 5338 5303

This represents a 35% improvement over last time.

Finally, bootstrap is faster too. On my laptop it now takes around 3 minutes 45 seconds, which is about 40 seconds faster than before.

Edit: some more minor optimizations improved reverse-complement runtime by another 500 milliseconds or so.