Friday, June 06, 2008

Persistent vectors

I've been working on the new HTTP server and web framework for what seems like ages now. I used it to implement a new pastebin, a wiki (webapps/wiki in the repository), and today I started working on a blog server. However, all this relational database CRUD can get pretty monotonous, so I decided to work on something a little more fun: porting Clojure's PersistentVector to Factor.

We have implementations of AVL trees and splay trees in the library which look pretty ugly, however they date back several years before we had some nice abstractions and idioms worked out, and the code could definitely be revisited and improved. Recently Eric Mertens implemented disjoint-sets; the code came out looking pretty nice, however though two words use locals. This got me interested in implementing data structures in Factor again.

The Java version weighs in at 210 lines, and the Factor version is 184 lines, with a few more blank lines and shorter lines in general. It is in the git repository now, and you can browse it online.

The code is pretty decent as far as Java code goes, however porting it to Factor presented two main challenges.

Some of the recursive functions used in the implementation take a large number of parameters. While Factor supports locals, I wanted to write this code entirely using the stack, to see how suitable stack languages are for implementing complex data structures.

I found that literal translations of the Java code looked pretty ugly, until I figured out a drastic simplification. The shift parameter, which is passed around in the Java code to store the current nesting level, can be stored in the tree node itself.

Here is the Java code for getting the ith element of a vector:
public Object nth(int i){
if(i >= 0 && i < cnt)
{
if(i >= tailoff())
return tail[i & 0x01f];
Object[] arr = root;
for(int level = shift; level > 0; level -= 5)
arr = (Object[]) arr[(i >>> level) & 0x01f];
return arr[i & 0x01f];
}
throw new IndexOutOfBoundsException();
}

In Factor, it looks like this:
: node-size 32 ; inline

: node-mask node-size mod ; inline

: node-shift -5 * shift ; inline

: node-nth ( i node -- obj )
[ node-mask ] [ children>> ] bi* nth ; inline

: body-nth ( i node -- i node' )
dup level>> [
dupd [ level>> node-shift ] keep node-nth
] times ; inline

: tail-offset ( pvec -- n )
[ count>> ] [ tail>> children>> length ] bi - ;

M: persistent-vector nth-unsafe
2dup tail-offset >=
[ tail>> ] [ root>> body-nth ] if
node-nth ;

It is a bit longer because of the utility words at the top - I chose to abstract out the node size instead of hard-coding it at 32 - and because nodes are pairs here, a sequence of children and an integer level.

The Java code also makes heavy use of early returns and out parameters. Factor supports early returns via continuations (recently, I added a with-return combinator and a return word to encapsulate this idiom) but I wanted to write the code in a functional style. And of course out parameters are possible too, but since we can just push two values on the stack before leaving a word, they make no sense. So I had to spend some time analyzing the code's control flow and thinking about how to express it in a more functional style.

For example, consider this Java code from Clojure:
public PersistentVector cons(Object val){
if(tail.length < 32)
{
Object[] newTail = new Object[tail.length + 1];
System.arraycopy(tail, 0, newTail, 0, tail.length);
newTail[tail.length] = val;
return new PersistentVector(meta(), cnt + 1, shift, root, newTail);
}
Box expansion = new Box(null);
Object[] newroot = pushTail(shift - 5, root, tail, expansion);
int newshift = shift;
if(expansion.val != null)
{
newroot = new Object[]{newroot, expansion.val};
newshift += 5;
}
return new PersistentVector(meta(), cnt + 1, newshift, newroot, new Object[]{val});
}

private Object[] pushTail(int level, Object[] arr, Object[] tailNode, Box expansion){
Object newchild;
if(level == 0)
{
newchild = tailNode;
}
else
{
newchild = pushTail(level - 5, (Object[]) arr[arr.length - 1], tailNode, expansion);
if(expansion.val == null)
{
Object[] ret = arr.clone();
ret[arr.length - 1] = newchild;
return ret;
}
else
newchild = expansion.val;
}
//expansion
if(arr.length == 32)
{
expansion.val = new Object[]{newchild};
return arr;
}
Object[] ret = new Object[arr.length + 1];
System.arraycopy(arr, 0, ret, 0, arr.length);
ret[arr.length] = newchild;
expansion.val = null;
return ret;
}

There are several early return statements in there, also an out parameter is used. Since the same Box instance is threaded through the recursion, a store into the box in an inner recursive call will propagate outwards. It took me a bit of thought to figure out the flow of data through this box.

The Factor code for the above operation:
: node-add ( obj node -- node' )
clone [ ppush ] change-children ;

: ppush-tail ( obj pvec -- pvec' )
[ node-add ] change-tail ;

: full? ( node -- ? )
children>> length node-size = ;

: 1node ( obj level -- node )
node new
swap >>level
swap 1array >>children ;

: 2node ( first second -- node )
[ 2array ] [ drop level>> 1+ ] 2bi node boa ;

: new-child ( new-child node -- node' expansion/f )
dup full? [ tuck level>> 1node ] [ node-add f ] if ;

: new-last ( val seq -- seq' )
[ length 1- ] keep new-nth ;

: node-set-last ( child node -- node' )
clone [ new-last ] change-children ;

: (ppush-new-tail) ( tail node -- node' expansion/f )
dup level>> 1 = [
new-child
] [
tuck children>> peek (ppush-new-tail)
[ swap new-child ] [ swap node-set-last f ] ?if
] if ;

: do-expansion ( pvec root expansion/f -- pvec )
[ 2node ] when* >>root ;

: ppush-new-tail ( obj pvec -- pvec' )
[ ] [ tail>> ] [ root>> ] tri
(ppush-new-tail) do-expansion
swap 0 1node >>tail ;

M: persistent-vector ppush ( obj pvec -- pvec' )
clone
dup tail>> full?
[ ppush-new-tail ] [ ppush-tail ] if
[ 1+ ] change-count ;

Again, here we have to deal with the fact that nodes are (sequence,level) pairs and not simple sequences.

One trick I'm using here is defining generic persistent sequence operations, ppush (like push but persistent), and new-nth (like set-nth but persistent). They are implemented naively on ordinary sequences, and the persistent vector implementation relies on the ordinary sequence implementation as the base case; when you ppush onto a persistent vector, eventually you end up ppushing onto the sequence of children stored in some node; this sequence is just a Factor array.

The Java code, which creates new constituents then constructs a new persistent vector at the end. This works well in Java. In Factor, I instead clone the persistent vector I am given (this is a shallow clone) and modify some of its slots. Same result, but simpler code because there are less values floating around.

I have a Factor word which outputs an empty persistent vector:
: pempty ( -- pvec )
T{ persistent-vector f 0 T{ node f { } 1 } T{ node f { } 0 } } ; inline

Using this word and ppush, we can convert any Factor sequence into a persistent vector:
pempty [ swap ppush ] reduce

We can then implement literal syntax for persistent vectors:
M: persistent-vector like
drop pempty [ swap ppush ] reduce ;

: >persistent-vector ( seq -- pvec ) pempty like ; inline

: PV{ \ } [ >persistent-vector ] parse-literal ; parsing

M: persistent-vector pprint-delims drop \ PV{ \ } ;

M: persistent-vector >pprint-sequence ;

Here is an example showing both reader syntax and prettyprinter output resulting from the above definitions:
( scratchpad ) PV{ "hello" "world" } ppop "Clojure" swap ppush .
PV{ "hello" "Clojure" }

Implementing and playing around with persistent vectors was definitely fun and educational. Most of the time I work with very basic data structures -- arrays, hashtables -- and implementing something a bit more complex was a good Factor coding exercise. I think the code could be cleaned up with some more abstractions, better factoring and new idioms. The cleave combinators we added recently definitely helped a lot here and without them the code would look a lot uglier than it does now. There is room for further improvement. I also plan on implementing Clojure's PersistentHashMaps and PersistentSortedMaps at some point to see what they look like in Factor.

No comments: