Monday, November 28, 2005

New hashtable implementation brings 400kb reduction in image size

I put Factoroids aside for now. I got it in a further state than shown in the screenshot, but I wanted to return to Factor hacking.

I reimplemented the hashtable to use open addressing instead of bucket chaining. With bucket chaining, the hashtable is backed by an array of buckets; keys are sorted by hashcode into buckets, where each bucket is an association list. With open addressing, the hashcode determines the starting index in the array; if the index is occupied by another key, the next index is searched, and so on.

On a 32-bit system, the old implementation, each key/value pair stored in a hashtable required two cons cells -- that's 16 bytes -- and each bucket required 4 bytes of storage. Since the number of buckets was kept at roughly the same as the number of key/value pairs, that is 20 bytes per hashtable entry. With the new implementation, a hashtable entry is just a pair of consecutive array elements -- 8 bytes. The underlying arrays are kept larger than before, so the end result is a savings of not quite 12 bytes per key/value pair, but enough to reduce the image size by 400Kb, and the time to do a full garbage collection by 10ms (on a 1.25 Ghz PowerPC G4).

There were some changes to the hashtable words as a result.

hash* ( key hash -- [[ key value ]] ) is now hash* ( key hash -- value ? ). Unlike hash ( key hash -- value ), this word can detect the difference between a value of f, and a key that is not present at all. However, the old stack effect was not only clumsy -- most callers would call cdr on it after checking if it is not f; but it makes no sense with the new implementation, and would require allocating a cons cell. So the new implementation returns the value, but also pushes a boolean that denotes if the key is in the hashtable.

Along similar lines, the hashtable combinators that would formely pass a cons holding a key/value pair to their quotation now pass the key and value separately on the stack:

Also, association lists are, for all intents and purposes, gone. An association list is a list of cons cells where the car is a key and the cdr is a value. They were used in a few places as "lighter weight" mappings than hashes. However, now there's no reason to use them anymore. This means the following words are now gone:

One final change is the literal syntax. Here is the new syntax:
{ "garlic bread" "appetizer" }
{ "steak" "main" }
{ "ice cream" "desert" }

The difference is that key/value pairs are entered with { ... } syntax, not [[ ... ]]. I always thought the old syntax looked kind of cluttered, and now the reason to use cons cells here is gone.

The other thing I've worked on lately is fixing some compiler bugs; one that was reported recently, and a handful of long-standing ones. All of them turned out to be related by a common thread: inlined recursive words. This is always tricky because the compiler must take care to note which values remain constant between recursive calls, and which do not. While the idea is simple there are a lot of tricky details, and we had a number of problems where values would be assumed to be literal when they were not, resulting in inappropriate optimizations being performed.

In other news, I now have an dual core AMD64 machine. I will begin porting Factor to this architecture very soon. I know a few people have been looking forward to running Factor on their AMD64 machines, and I hope to make them happy.

No comments: