Wednesday, August 06, 2008

Persistent hashtables

Long time no blog! I've been very busy (with Factor), and there has been a lot of activity lately. I've been working flat-out on the new compiler, which I will blog about later.

Yesterday and today though, I worked on porting Clojure's PersistentHashMap to Factor. We already had persistent vectors; the hashtables are more complex. My implementation is pretty much a direct port of the Java code, and in some parts it uses locals heavily. I plan on revisiting it at some point and using more Factor idioms. Even though the code is not idiomatic, it is still pretty easy to follow, and Factor's direct support from having multiple return values eliminated the need for a temporary 'box' object.

My initial tests seem to indicate persistent hashtables are 2-4x slower than normal hashtables for common operations. This isn't so bad, since its only a constant time slowdown, and you gain immutability, but it is still a bit disappointing. Perhaps they can be optimized further.

The reason I wanted to implement persistent hashtables in the first place is to use them in the new Factor compiler. Instead of cloning a large hashtable repeatedly in one place, which I suspected was leading to quadratic algorithmic complexity, I wanted to use persistent hashtables so that I could add entries in constant time while still retaining older versions for backtracking.

It turns out I found another way to get the performance gain I wanted, and so persistent hashtables are unused in the new compiler for now.

However they were fun to implement, and they're now part of the basis library; an intermediate layer between core and extra, in the persistent.hashtables vocabulary. They sit alongside persistent.vectors and persistent.heaps, the latter of which was implemented by Daniel Ehrenberg.

The new basis vocabulary root will contain libraries which are not fundamentally part of the language core, and are not required for bootstrap, but those that anyone would expect a language implementation to provide. Examples include various data structures, I/O, XML parsing, database access, the GUI toolkit, the web framework, and so on. I also moved the optimizing compiler to basis, since its not required for bootstrap either.

In order for a vocabulary to be moved to basis, it must implement some functionality that's frequently used required, and have high quality documentation, robust and optimized implementation, and as many unit tests as possible. Over the next few months, I hope that more vocabularies are moved to basis and that the code quality there only goes up.


David R. MacIver said...

For what it's worth, my implementation of a persistent hash table in Scala is showing about the same degree of slow down in comparision to mutable hash tables, and it's a completely different implementation (Mine is essentially a mapping of hash codes to association lists, where the mapping is based on Okasaki and Gill's "Fast Mergeable Integer Maps". Essentially a specialised binary trie).

Some of it I don't think can be optimised further. In particular I think updates are always going to have a slow down of around that amount due to simple allocation bottlenecks. get performance can be improved by flattening the underlying trees, but that will result in a larger copying overhead every time you update. On the other hand there were a couple things I encountered in the course of implementing where I thought "Ok, this is as optimised as it's going to get" and half an hour later it was five times faster. :-)

One thing I've noticed, and you might too, is that bulk operations on the tree based immutable maps tend to be faster. In particular transformations, filters, iteration, etc. all seem to get a significant speedup. It's a little surprising.

David R. MacIver said...

Hm. On reading further, Clojure's approach is based on "Hashing a key and storing the key in a trie based on this hash value". That sounds awfully familiar after all. :-)