Monday, June 30, 2008

An interesting use-case for the spread combinator

The spread combinator is the least frequently used of any of the cleave family.

Recall that it takes a sequence of quotations, and applies the nth quotation to the nth stack element. So for example,
1 2 3 4 { [ 1 + ] [ 1 - ] [ 2 / ] [ 2 * ] } spread

Leaves us with the following stack:
2 1 1+1/2 8

It is easy to see why it is not used frequently; very rarely do we have 4 items on the stack, and then decide to manipulate each one individually. The bi* and tri* combinators special-case spread for 2 or 3 stack items, respectively; bi* is used pretty often, tri* not so much.

However, it turns out cleave is very useful when meta-programming. Consider this code from the FFI implementation:
: (param-prep-quot) ( parameters -- )
dup empty? [
drop
] [
unclip c-type c-type-unboxer-quot %
\ >r , (param-prep-quot) \ r> ,
] if ;

: param-prep-quot ( node -- quot )
parameters>> [ <reversed> (param-prep-quot) ] [ ] make ;

We traverse a sequence recursively, stopping if it is empty, otherwise we use unclip to break it down into the first element and the of the sequence. We surround the result of the recursive call in a >r/r> pair.

Given something like { "float" "int" "char*" }, this generates code like the following:
utf8 string>alien >r >r >float r> r>

This code was written before spread, and I just discovered that it is a perfect use-case for it:
: (param-prep-quot) ( parameters -- )
c-type c-type-unboxer-quot ;

: param-prep-quot ( node -- quot )
parameters>> [ (param-prep-quot) ] map [ spread ] curry ;

Much more elegant.

This pattern came up in the FFI twice, and I refactored both usages. It also came up in something I'm working on; type declarations for tuple slots. I will blog in detail when this is done, but the basic idea is that you can assert that a tuple slot has a certain type when declaring the tuple class. Storing into the slot will check the type at runtime; the compiler will elide these checks when necessary, and make assumptions about the slot value's type when the slot is read. The boa word, which makes a tuple and fills in the slots from the stack, needs to check that the values on the stack satisfy the type declartions. It does this using a similar technique to the above; mapping over a sequence to make an array of quotations, and then applies to to spread.

Sunday, June 29, 2008

Cedric's code challenge

Cedric Beust posed a coding challenge:

Here is an interesting coding challenge: write a counter function that counts from 1 to max but only returns numbers whose digits don't repeat.

...
Also:
  • Display the biggest jump (in the sequences above, it's 4: 98 -> 102)
  • Display the total count of numbers
  • Give these two values for max=10000

In the comments section of his blog, you can find many solutions, some are long, some are short but have lovely punctuation soup like !/(.).*\1/, 1..MAX; (that's from the Perl solution).

Here is a Factor version that is shorter than anything in the comments there:
USING: kernel math.parser math.ranges math.vectors sequences sets ;

1 10000 [a,b] [ number>string all-unique? ] filter
[ ] [ length ] [ dup 0 prefix v- supremum ] tri

It leaves three values on the stack (hence tri): a sequence of numbers with non-repeating digits, the length of this sequence, and the largest jump between any two numbers in the sequence.

Of course one could claim that this is cheating, since I'm using a lot of library words here. Fair enough, but those library words in the library for a reason: they get used a hell of a lot.

Friday, June 27, 2008

Windows handle inheritance

I was debugging the deploy tool yesterday. The problem was that after I changed the HTTP request/response parsers to use PEGs, the HTTP client stopped working in deployed applications. After fixing this, I added a regression test to ensure this won't be a problem in the future. The test passed on my Mac, I pushed my fixes, and moved on. However, soon I got a notification from the Windows build machine that the test failed on Windows.

After some further investigation, I noticed that if I start the HTTP server, and then spawn the deployed binary from the command line, the test works, but if I spawn the deployed binary from within Factor using run-process, it would fail. So this was not even specific to deployment, or the HTTP code; it would happen any time the following series of events occurred:
  • Factor instance A starts a TCP/IP server socket on port 1234
  • Factor intsance A spawns a Factor instance B
  • Factor instance B attempts to connect to port 1234
  • It times out while establishing the connection

I then realized that handle inheritance was the problem here. It turns out my understanding of handle inheritance was wrong, and this article set me straight. Turns out any handle you create is inheritance by default, so the Factor instance B was inheriting the server socket from A, so when B was attempting to connect to the server socket, it would hit the inherited copy, and not the copy in A which was actually being listened on.

Refactoring the Windows I/O code to set all handles to be non-inheritable by default, and only setting the inheritance flag on handles that are passed to subprocesses for I/O redirection, fixed the problem.

Just goes to show how important continuous integration is.

Wednesday, June 18, 2008

HTTPS support in http.server, some notes on HTTP request/response parsing

Generic servers in Factor


The io.servers.connection vocabulary supersedes the io.server vocabulary, adding support for a connection count limit, SSL, and more configuration flags.

The best way to explain how it works is with an example:
USING: accessors io io.servers.connection ;

<threaded-server>
"hello-world" >>name
100 >>max-connections
1080 >>insecure
1090 >>secure
<secure-config>
"cert/server.pem" >>key-file
>>secure-config
[ "Hello, world." print ] >>handler
start-server

This is a server which sends "Hello, world." to the client, then closes the connection. It accepts standard TCP/IP connections on port 1080, and SSL connections on port 1090. For SSL, it identifies itself to the client with the certificate in certs/server.pem. We set a limit of 100 concurrent connections, and connections are logged to resource:logs/hello-world/*.log.

Here is another example:
USING: accessors io io.servers.connection io.launcher ;

<threaded-server>
"telnet" >>name
10666 >>insecure
[
<process>
"/bin/sh -i" >>command
input-stream get >>stdin
output-stream get >>stdout
+stdout+ >>stderr
run-process
] >>handler
start-server

Please do not run this on a machine facing the Internet; it starts a server on port 10666 which spawns a bash shell, with I/O redirected to the client connection. This example won't work on Windows, or with SSL connections, because those types of streams cannot be stored in the stdin and stdout slots of a process tuple yet; however that will be fixed eventually, and then code such as the above might serve as a prototype of a Factor re-implementation of inetd.

So that's io.server.connections for you: a lot of functionality wrapped up in a small package. Eduardo Cavazos is working on a DNS server in Factor right now, and at some point we're going to isolate some UDP/IP abstractions from his code which are not specific to DNS, and move those to io.servers.packet.

HTTPS support in http.server


Using io.servers.connection, I implemented https support in the HTTP server. This was pretty easy, now that all the components are in place. The main changes I made were to the authentication system in the web framework. It now redirects you to the SSL port when a page requiring login is requested, however this can be disabled (although I would not encourage this).

For web applications not using the authentication framework, there is another useful utility: any HTTP responder can be wrapped in a <secure-only responder; this upgrades the connection SSL if a non-SSL request is made.

To improve performance, I added session resumption support to io.sockets.secure, for both client and server connections; this can be disabled in the <secure-config> instance passed to with-secure-context. The SSL library still needs support for client authentication and it also needs to run on Windows; once those two items are done, it will pretty much be feature-complete. I might write bindings to the OpenSSL cyphers allowing them to be used stand-alone, too, however this is low-priority since I don't need that for anything right now.

Parsing HTTP requests and responses


I also replaced the hand-coded code for parsing HTTP requests and responses with new code that uses Parser Expression Grammars (PEGs). While the new code was not a lot shorter, it does have a more declarative flavor, and is also more correct.

During the course of this exercise, I learned how far modern web browsers deviate from the specifications, particularly in the area of cookies. Cookies are specified by a pair of RFCs, RFC2109 and RFC2965, however the latter seems to be almost universally ignored, and browser and server developers seem to treat the former as more of a set of ideas than a formal spec.

Mark Nottingham blogged about this a little under two years ago, and seemingly, not much has changed since then. There is also this bug in Firefox.

After I mentioned these difficulties in the #concatenative IRC channel, Zed Shaw joined the ensuing discussion and noted that the RFCs themselves are quite vague and seemingly contradict themselves.

The general design of cookies makes it easy to expose vulnerabilities such as this one in the Tomcat application server. I hope my code is secure but I will need to review it later in more detail to be sure.

Saturday, June 14, 2008

What is a fluent interface?

I just saw this on reddit, and I have no clue. Whatever a "fluent interface" is though, seems Factor doesn't need one, because their 21 line example can be written a million times more clearly in 4 lines of Factor:
USING: sequences strings math.ranges io ;
{ "i" "can" "has" "cheezburger" "plz" "?" } [ " " append ] map concat "\n" append
CHAR: A CHAR: Z [a,b] reverse [ CHAR: M > ] filter [ 1string ] map "," join
append print

Tuesday, June 10, 2008

Dequeue protocol, and search dequeues

While debugging the compiler and fixing some performance regressions yesterday, I ran into a situation where I needed a data structure with efficient insertion and removal at both ends, and constant-time membership testing.

This is pretty easy to implement, by combining a double linked list with a hashtable, however I wanted this to have the same interface as a double linked list, so I made a new dequeues vocabulary which abstracts the operations on a double linked list, such as
  • push-front
  • push-back
  • pop-front
  • pop-back

Double linked lists in the dlists vocabulary now implement this protocol instead of exposing dlist-specific operations.

The dequeue protocol also has a dequeue-member? generic word. This word runs in linear time on dlists.

Search dequeues in the search-dequeues wrap an assoc and a dequeue and provide a sub-linear implementation of dequeue-member? (constant time if the assoc is a hashtable). Pushing elements adds them to both the dequeue and assoc; the membership test is then just a call to key? on the assoc.

Saturday, June 07, 2008

Syntax proposal for multi-methods

At the end of March, I outlined the future direction for Factor. In that list of major planned features, two are already done: inheritance and singletons. The new tuple syntax is relatively minor (except for updating existing code), so that leaves multi-methods. I really want to get multi-methods done by the end of June, or sometime in July perhaps, and then focus on the compiler, documentation, bug fixes, and libraries. I'm going to draw a line in the sand soon; no more language changes until 1.0.

While I've implemented a rough draft of multi-methods already, I'm not happy with the syntax. When I revisit multi-methods, which I will soon, I will implement a radically different syntax that Eduardo and I came up with today.

Code will have to be updated for multi-methods anyway; I didn't want to push out a lackluster syntax and have to change it later. I think now I've come up with a very good solution. There are still some minor details to work out, but I'm confident that they're solvable.

First, a bit of background. Here is a word definition in Factor:
: v- ( seq n -- seq' ) [ - ] curry map ;

Here is a generic word definition with the single-dispatch core generics:
GENERIC: present ( obj -- str )

M: string present ;

M: real present number>string ;

M: word present word-name ;

M: object present drop "an object" ;

Here is the equivalent with extra/multi-methods:
USE: multi-methods

GENERIC: present ( obj -- str )

METHOD: present { string } ;

METHOD: present { real } number>string ;

METHOD: present { word ] word-name ;

Here is the syntax we thought of and will most likely implement soon:
: present ( |string -- string ) ;

: present ( |real -- string ) number>string ;

: present ( |word -- string ) word-name ;

Whoa! Those look like ordinary colon definitions! Well, here's the kicker. All words would be generic (and the distinct notion of a "generic word" won't exist; it will just be the default and only type of word). Indeed, any entries in the stack effect which contain | are specializers, with the text after | being the class name to specialize on.

Right now, redefining the same word or method multiple times is an error. This won't change. The following will be invalid:
: write-content ( text -- ) ... ;

: write-content ( text -- ) ... ;

The following will be OK though:
: write-content ( text|string -- ) ... ;

: write-content ( text|xml -- ) ... ;

If you specify a method with no specializers, it is just the default method:
: write-content ( text -- ) ... ;

And if you don't care to make a word generic, you just use : ... ; as before; a generic word with one default method behaves exactly like a normal word.

All stack effects of methods will need to be 'congruent'; the following is not OK:
: combine ( s1|sequence s2|sequence -- result ) ... ;

: combine ( s1|sequence s2|sequence s3|sequence -- result ) ... ;

Now a big selling point of generic words is that you can put methods on generic words in other vocabularies; otherwise, they'd really be no different from pattern matching in ML. Right now, : always defines a word in the same vocabulary. But this will change, and you will be able to refer to qualified word names:
TUPLE: my-seq ... ;

: sequences#length ( s|my-seq -- len ) ... ;

: sequences#nth ( i|integer s|my-seq -- value ) ... ;

Right now, we already have a qualified library for disambiguating the search path with explicit vocabulary names. You first explicitly request qualified access to a vocabulary,
USE: qualified
QUALIFIED: sequences

Then proceed to refer to qualified names:
3 "hello world" sequences:nth

Well, I don't like the use of : there; it was hastily thought out in my opinion (no offence, Dan :-) because it clashes with our usage of : as a word definition operator. So instead I propose #; it has precedent in HTML (anchors in URLs) and Ocaml (method references).

Whereas with :, you will be able to define new words in the current vocabulary, and add methods to words in the current vocabulary, when referring to a word in another vocabulary you will only be allowed to refer to words that already exist. This is to catch mistakes like the following:
: seqences#lgenth ( i|integer s|my-seq -- value ) ... ;

We don't want people to pollute existing vocabularies with their typos, and we want a parse-time error if a generic word in another vocabulary is renamed. Within the same vocabulary, you will be able to make a mistake like the following, which you cannot do right now:
: sheeple ( s|integer -- ) ... ;

: shepele ( s|string -- ) ... ;

Creating two words instead of one word with two methods. The parser won't be able to catch this mistake anymore, because whereas before M: would complain if the generic word didn't exist, : creates the word if necessary. However, if qualified word creation is prohibited, this potential problem is limited in scope to one vocabulary and good unit testing will catch it.

Sometimes you want to define a generic word with no methods; the methods, and the concrete data types they specialize on, are defined in other vocabularies. Right now this is easy;
GENERIC: foo

How would you do this in the new system you ask? Well, since we're merging M: with :, we should merge GENERIC: with its closest non-generic equivalent, which happens to be DEFER:. The DEFER: word basically means, "create a word with this name, so that I can refer to it, but don't give it a definition; I'll supply the definition later". DEFER: is rarely used right now, its mostly just useful for mutually recursive words, which don't come up often. If all words become generic, then DEFER: basically means "give me a generic word, but don't put any methods on it yet". Neat, huh?

Another problem this solves is composable word definition types. Right now we have : for normal words, GENERIC: for generic words and M: for methods. In extra/, there is :: for locals, MEMO: for memoized words, and MACRO: for macros. But what if you want a macro with named parameters and locals? Or a memoized generic? Well, MACRO:: happens to exist, but not every combination is there, and in general there is a combinatorial explosion with the current system. We will still have ::, but it will subsume the old :: as well as M::. Furthermore, we won't need local variants of MEMO: and MACRO: at all. Instead, MEMO: foo ( a b -- c ) will be like DEFER:; it will create a generic word with no methods, but with the additional perk that it memoizes on its inputs.

This removes the double-up of defining words for variants with named parameters; we don't need MEMO::, MACRO::, M::, etc.

So it ends up looking a bit more verbose. Current Factor:
MEMO: fib ( n -- result )
dup 1 > [ [ 1 - fib ] [ 2 - fib ] bi + ] when ;

Proposed definition syntax:
MEMO: fib ( n -- result )

: fib ( n -- result )
dup 1 > [ [ 1 - fib ] [ 2 - fib ] bi + ] when ;

However, macros and memoized words don't come up that frequently, and this not only simplifies the existing implementation by removing defining words, it opens up new possibilities: generic macros, generic memoized words, a generic memoizing word where some methods use named parameters, etc. Cool stuff.

There's a new idiom that this enables here, and that is a word which conceptually isn't generic (even though all words are, you don't actually want to override every word necessarily), it simply type checks its arguments. For example, right now I have a create-method word which does this:
TUPLE: check-method class generic ;

: check-method ( class generic -- class generic )
over class? over generic? and [
\ check-method boa throw
] unless ; inline

: <method> ( class generic -- method )
check-method
[ method-word-props ] 2keep
method-word-name f <word>
[ set-word-props ] keep ;

That's pretty disgusting. Here is the new style:
: <method> ( |class |generic -- method )
[ method-word-props ] 2keep
method-word-name f <word>
[ set-word-props ] keep ;

Scrap the boilerplate!

This formalizes a pattern that's been coming up frequently, and also allows the compiler to perform better optimizations.

I also plan on allowing the type of the return values to be annotated. While this won't be used for dispatching (since its impossible unless you predict the future), the compiler will insert runtime checks to ensure that the returned values are indeed of the asserted type (optimizing them out if necessary). Furthermore, the compiler can assume the return type at call sites, which will enable further optimizations without inlining.

Finally, parts of the documentation can be auto-generated. Here is the help for the append word:
HELP: append
{ $values { "seq1" sequence } { "seq2" sequence } { "newseq" sequence } }
{ $description "Outputs a new sequence of the same type as " { $snippet "seq1" }
" consisting of the elements of " { $snippet "seq1" } " followed by " { $snippet
"seq2" } "." }
{ $errors "Throws an error if " { $snippet "seq2" } " contains elements not
permitted in sequences of the same class as " { $snippet "seq1" } "." } ;

If instead it were defined as
: append ( seq1|sequence seq2|sequence -- newseq|sequence )

We wouldn't need the $values element in the help at all in this case at least.

It would also make it easier to write generic code. You'd still have to think of good protocols to expose, and stuff like that; there's no silver bullet to make code automagically generic and reusable; but you'd pay less of a penalty for making code more flexible. Right now, the basic operations on sequences, such as nth and length are generic, but more complex ones, such as push, append, and so on, are not. This sucks if someone wants to implement a custom sequence type; a rope, for example; where append can be implemented more efficiently than the current implementation. While right now I could make append generic (along with all other sequence operations) in the future I won't need to. Instead you'll just be able to override any sequence operation, within reason. So this,
GENERIC: append ( seq1 seq2 -- newseq )

M: object append ... default implementation ... ;

Will collapse down to:
: append ( seq1 seq2 -- newseq ) ... default implementation ... ;


A final point to mention is hooks. Right now, core generics support a type of generic word known as a "hook". A hook is a generic word which dispatches on the value of a dynamically-scoped variable rather than a stack position. The multi-methods vocabulary generalizes this so that method definitions can be declared to dispatch on arbitrary variables. Well, when any word is able to be generic, you will be able to override any word with a new behavior that takes effect if a given variable has a certain type! This is insanely powerful and ripe for abuse, so the documentation will make it clear that this should only be done for words which are documented to have a contract that is compatible with hook overriding. In addition to being used where hooks are used now (I/O and UI backends, stuff like that) I believe this feature will be useful when unit testing code. If I'm working on a vocabulary, I don't need to explicitly make protocols for the sole purpose of mocking out some objects. I still need to write code in a style that's amenable to unit testing, but doing so becomes easier if the unit tests can specialize a word on a dynamic variable!

As far as efficiency goes, I don't anticipate any slowdowns there. The two generic word systems we have now already optimize the case where you have a generic word with a sole method on object. So the following two definitions compile to the exact same machine code:
! Version 1
: foo ... ;

! Version 2
GENERIC: foo

M: object foo ... ;

You only pay a price if you begin to specialize on different types. In addition, multiple dispatch will replace instances of hand-coded double dispatch; this will yield a performance boost. Finally, being able to declare input types on some words will not only be great for documentation purposes, it will allow the compiler's dispatch elimination machinery to kick it up a notch. Right now it only works within a single word, so the optimizer first has to inline a lot before it can pay off. This increases compile time. If it can optimize types across word boundaries, a whole new set of program analysis techniques are possible.

Let's put this another way. The following parsing words will be eliminated:
  • M:
  • GENERIC:
  • MEMO::
  • MACRO::
  • M::
  • GENERIC#
  • HOOK:

In return, you gain a huge amount of expressive power: multiple dispatch. Wow!

My main priority right now is still the web framework. One the new wiki, pastebin and blog are in good shape I will blog about them in detail and deploy them on factorcode.org. After things settle down with the new web site, I will return to language work and implement the above!

Cleave combinators and URL objects

Expressing intent with cleave combinators


Even though the cleave combinators (bi, tri, bi*, tri*, and various others) are equivalent to a chain of keeps, eg,
[ A ] [ B ] [ C ] tri == [ A ] keep [ B ] keep C

They shouldn't always be used as a substitute for keep and other forms of stack shuffling. This is because the cleave form expresses intention. In particular, they probably should not be used if the quotations take more than one input.

For example, here is a piece of code in cairo.gadgets:
M: png-gadget render*
path>> normalize-path cairo_image_surface_create_from_png
[ cairo_image_surface_get_width ]
[ cairo_image_surface_get_height 2array dup 2^-bounds ]
[ [ copy-surface ] curry copy-cairo ] tri
GL_BGRA render-bytes* ;

I would instead write it as follows:
M: png-gadget render*
path>> normalize-path cairo_image_surface_create_from_png
[
[ cairo_image_surface_get_width ]
[ cairo_image_surface_get_height ]
bi 2array dup 2^-bounds
]
[ [ copy-surface ] curry ] bi
GL_BGRA render-bytes* ;

While this is a bit longer I think it is easier to understand.

The reason I'm not interested in enforcing this with a type system, however, is because there are exceptions.

One exception is when the quotations passed to the cleaver have a 'pipeline' effect ( obj val -- obj ). Then I think it is pretty clear what is going on if you do this -- you're storing the same value into two slots:
[ >>a ] [ >> b ] bi

or this -- you're storing three values into slots:
[ >>a ] [ >>b ] [ >>c ] tri*

For example, suppose you have a word which take a string and parses it, outputting three values:
: parse-foo ( str -- a b c )

And you want to write a constructor for a tuple type:
: <bar> ( str d e -- bar )
bar new
swap >>e
swap >>d
swap parse-foo [ >>a ] [ >>b ] [ >>c ] tri* ;

A concrete example is the URL parsing code in the urls vocabulary. Here are a couple of pretty long (but hopefully straightforward) definitions which demonstrate this idiom:
: parse-host-part ( url protocol rest -- url string' )
[ >>protocol ] [
"//" ?head [ "Invalid URL" throw ] unless
"@" split1 [
[ ":" split1 [ >>username ] [ >>password ] bi* ] dip
] when*
"/" split1 [
parse-host [ >>host ] [ >>port ] bi*
] [ "/" prepend ] bi*
] bi* ;

M: string >url
<url> swap
":" split1 [ parse-host-part ] when*
"#" split1 [
"?" split1
[ url-decode >>path ]
[ [ query>assoc >>query ] when* ] bi*
]
[ url-decode >>anchor ] bi* ;

First-class URLs


Which brings me to my next topic, URLs. I'm a big fan of using the most specific data types possible for everything: timestamps instead of integers with the implicit convention that they store milliseconds since an epoch, time durations instead of integers (distinct from timestamps), symbols instead of magic strings and numbers, specialized arrays where possible, and so on. I also think that path names, money and dimensioned units should be first-class, instead of strings and numbers, respectively; while we have libraries for those latter three they're not used as much as the former (a shame, really).

While working on the routing and redirection part of Factor's web framework, I noticed a lot of boilerplate coming up that I couldn't easily abstract away. The natural solution seemed to be to promote URLs to a first-class data type instead of passing strings around.

URLs are very easy to use. First you have literal URLs:
URL" http://www.apple.com"

You can also construct them from strings at runtime:
"http://www.apple.com/foo.html#blah" >url

Or you can construct them from components:
<url>
"ftp" >>protocol
"ftp.cdrom.com" >>host
"/pub" >>path

Setting query parameters is supported, and the URL code takes care of URL encoding and decoding for you:
( scratchpad ) URL" http://search.yahooapis.com/WebSearchService/V1/webSearch?query=Hello+World"
( scratchpad ) "query" query-param .
"Hello World"

Likewise, you can set query parameters with set-query-param.

URLs support two main operations: converting back to a string, and deriving a URL from a base URL and a relative URL.

For converting URLs to a string, I decided to make a new generic word and put a method on that for URLs, instead of defining a word specific to URLs. The word is called present and it is in the present vocabulary. It has methods for the following types, so far:
  • strings
  • numbers
  • words
  • timestamps
  • URLs

Unlike unparse, which converts an object into source code, present outputs a string which is more human-readable, and the process is possibly lossy. For example, strings present as themselves, and numbers present as their string representation. Timestamps present in RFC822 format.

Deriving a URL from a base URL and a relative URL allows you to take a URL with some components possibly missing, and fill them in from a base URL. The web framework uses this to ensure that all HTTP redirects sent to the client are absolute.

I refactored Daniel Ehrenberg's yahoo library to use URL objects for constructing the search query. The code is really concise now; just build a URL, pass it to http-get, and parse the XML result to build an array of search result objects. Parsing the XML only takes 5 lines of code!

Sometime soon, Doug Coleman will update his FTP client library to use URL objects also, at which point we can define a protocol for reading and writing generic URLs, allowing code to be written which can deal with both FTP, HTTP, and HTTPS abstractly.

Friday, June 06, 2008

Persistent vectors

I've been working on the new HTTP server and web framework for what seems like ages now. I used it to implement a new pastebin, a wiki (webapps/wiki in the repository), and today I started working on a blog server. However, all this relational database CRUD can get pretty monotonous, so I decided to work on something a little more fun: porting Clojure's PersistentVector to Factor.

We have implementations of AVL trees and splay trees in the library which look pretty ugly, however they date back several years before we had some nice abstractions and idioms worked out, and the code could definitely be revisited and improved. Recently Eric Mertens implemented disjoint-sets; the code came out looking pretty nice, however though two words use locals. This got me interested in implementing data structures in Factor again.

The Java version weighs in at 210 lines, and the Factor version is 184 lines, with a few more blank lines and shorter lines in general. It is in the git repository now, and you can browse it online.

The code is pretty decent as far as Java code goes, however porting it to Factor presented two main challenges.

Some of the recursive functions used in the implementation take a large number of parameters. While Factor supports locals, I wanted to write this code entirely using the stack, to see how suitable stack languages are for implementing complex data structures.

I found that literal translations of the Java code looked pretty ugly, until I figured out a drastic simplification. The shift parameter, which is passed around in the Java code to store the current nesting level, can be stored in the tree node itself.

Here is the Java code for getting the ith element of a vector:
public Object nth(int i){
if(i >= 0 && i < cnt)
{
if(i >= tailoff())
return tail[i & 0x01f];
Object[] arr = root;
for(int level = shift; level > 0; level -= 5)
arr = (Object[]) arr[(i >>> level) & 0x01f];
return arr[i & 0x01f];
}
throw new IndexOutOfBoundsException();
}

In Factor, it looks like this:
: node-size 32 ; inline

: node-mask node-size mod ; inline

: node-shift -5 * shift ; inline

: node-nth ( i node -- obj )
[ node-mask ] [ children>> ] bi* nth ; inline

: body-nth ( i node -- i node' )
dup level>> [
dupd [ level>> node-shift ] keep node-nth
] times ; inline

: tail-offset ( pvec -- n )
[ count>> ] [ tail>> children>> length ] bi - ;

M: persistent-vector nth-unsafe
2dup tail-offset >=
[ tail>> ] [ root>> body-nth ] if
node-nth ;

It is a bit longer because of the utility words at the top - I chose to abstract out the node size instead of hard-coding it at 32 - and because nodes are pairs here, a sequence of children and an integer level.

The Java code also makes heavy use of early returns and out parameters. Factor supports early returns via continuations (recently, I added a with-return combinator and a return word to encapsulate this idiom) but I wanted to write the code in a functional style. And of course out parameters are possible too, but since we can just push two values on the stack before leaving a word, they make no sense. So I had to spend some time analyzing the code's control flow and thinking about how to express it in a more functional style.

For example, consider this Java code from Clojure:
public PersistentVector cons(Object val){
if(tail.length < 32)
{
Object[] newTail = new Object[tail.length + 1];
System.arraycopy(tail, 0, newTail, 0, tail.length);
newTail[tail.length] = val;
return new PersistentVector(meta(), cnt + 1, shift, root, newTail);
}
Box expansion = new Box(null);
Object[] newroot = pushTail(shift - 5, root, tail, expansion);
int newshift = shift;
if(expansion.val != null)
{
newroot = new Object[]{newroot, expansion.val};
newshift += 5;
}
return new PersistentVector(meta(), cnt + 1, newshift, newroot, new Object[]{val});
}

private Object[] pushTail(int level, Object[] arr, Object[] tailNode, Box expansion){
Object newchild;
if(level == 0)
{
newchild = tailNode;
}
else
{
newchild = pushTail(level - 5, (Object[]) arr[arr.length - 1], tailNode, expansion);
if(expansion.val == null)
{
Object[] ret = arr.clone();
ret[arr.length - 1] = newchild;
return ret;
}
else
newchild = expansion.val;
}
//expansion
if(arr.length == 32)
{
expansion.val = new Object[]{newchild};
return arr;
}
Object[] ret = new Object[arr.length + 1];
System.arraycopy(arr, 0, ret, 0, arr.length);
ret[arr.length] = newchild;
expansion.val = null;
return ret;
}

There are several early return statements in there, also an out parameter is used. Since the same Box instance is threaded through the recursion, a store into the box in an inner recursive call will propagate outwards. It took me a bit of thought to figure out the flow of data through this box.

The Factor code for the above operation:
: node-add ( obj node -- node' )
clone [ ppush ] change-children ;

: ppush-tail ( obj pvec -- pvec' )
[ node-add ] change-tail ;

: full? ( node -- ? )
children>> length node-size = ;

: 1node ( obj level -- node )
node new
swap >>level
swap 1array >>children ;

: 2node ( first second -- node )
[ 2array ] [ drop level>> 1+ ] 2bi node boa ;

: new-child ( new-child node -- node' expansion/f )
dup full? [ tuck level>> 1node ] [ node-add f ] if ;

: new-last ( val seq -- seq' )
[ length 1- ] keep new-nth ;

: node-set-last ( child node -- node' )
clone [ new-last ] change-children ;

: (ppush-new-tail) ( tail node -- node' expansion/f )
dup level>> 1 = [
new-child
] [
tuck children>> peek (ppush-new-tail)
[ swap new-child ] [ swap node-set-last f ] ?if
] if ;

: do-expansion ( pvec root expansion/f -- pvec )
[ 2node ] when* >>root ;

: ppush-new-tail ( obj pvec -- pvec' )
[ ] [ tail>> ] [ root>> ] tri
(ppush-new-tail) do-expansion
swap 0 1node >>tail ;

M: persistent-vector ppush ( obj pvec -- pvec' )
clone
dup tail>> full?
[ ppush-new-tail ] [ ppush-tail ] if
[ 1+ ] change-count ;

Again, here we have to deal with the fact that nodes are (sequence,level) pairs and not simple sequences.

One trick I'm using here is defining generic persistent sequence operations, ppush (like push but persistent), and new-nth (like set-nth but persistent). They are implemented naively on ordinary sequences, and the persistent vector implementation relies on the ordinary sequence implementation as the base case; when you ppush onto a persistent vector, eventually you end up ppushing onto the sequence of children stored in some node; this sequence is just a Factor array.

The Java code, which creates new constituents then constructs a new persistent vector at the end. This works well in Java. In Factor, I instead clone the persistent vector I am given (this is a shallow clone) and modify some of its slots. Same result, but simpler code because there are less values floating around.

I have a Factor word which outputs an empty persistent vector:
: pempty ( -- pvec )
T{ persistent-vector f 0 T{ node f { } 1 } T{ node f { } 0 } } ; inline

Using this word and ppush, we can convert any Factor sequence into a persistent vector:
pempty [ swap ppush ] reduce

We can then implement literal syntax for persistent vectors:
M: persistent-vector like
drop pempty [ swap ppush ] reduce ;

: >persistent-vector ( seq -- pvec ) pempty like ; inline

: PV{ \ } [ >persistent-vector ] parse-literal ; parsing

M: persistent-vector pprint-delims drop \ PV{ \ } ;

M: persistent-vector >pprint-sequence ;

Here is an example showing both reader syntax and prettyprinter output resulting from the above definitions:
( scratchpad ) PV{ "hello" "world" } ppop "Clojure" swap ppush .
PV{ "hello" "Clojure" }

Implementing and playing around with persistent vectors was definitely fun and educational. Most of the time I work with very basic data structures -- arrays, hashtables -- and implementing something a bit more complex was a good Factor coding exercise. I think the code could be cleaned up with some more abstractions, better factoring and new idioms. The cleave combinators we added recently definitely helped a lot here and without them the code would look a lot uglier than it does now. There is room for further improvement. I also plan on implementing Clojure's PersistentHashMaps and PersistentSortedMaps at some point to see what they look like in Factor.