Tuesday, March 11, 2008

Faster serialization

I figured out a nice trick that sped up Factor's serialization library. Because serialization is required to preserve circular and shared structure, it would previously store all objects serialized so far in a sequence. When a new object was encountered, it would scan the sequence for an object eq? to the one being serialized. This made serialization quadratic in the number of objects being serialized, which is unacceptable. While Factor's hashtables always check for equality using = and not eq?, I was able to simulate eq? hashtables by defining a new data type.

The new type is a tuple which holds a reference to a single object:
TUPLE: id obj ;

C: <id> id

The hashcode of an id is the hashcode of the underlying object:
M: id hashcode* obj>> hashcode* ;

The equality test, however, checks if the underlying objects are eq?:
M: id equal? over id? [ [ obj>> ] 2apply eq? ] [ 2drop f ] if ;

That is, two distinct instances of id are equal if their underlying objects are identical (got it?).

Now, we just have to replace two words to use a hashtable with id keys instead of a linear search of a sequence:
: add-object ( obj -- )
#! Add an object to the sequence of already serialized
#! objects.
serialized get [ assoc-size swap ] keep set-at ;

: object-id ( obj -- id )
#! Return the id of an already serialized object
serialized get at ;

True id hashtables will be added at some point in the future, but for now this will suffice.

1 comment:

Daniel Ehrenberg said...

Who needs "true" identity hashtables? We can make an "identity assoc" wrapper, which takes an exemplar and makes an identity assoc that hides the id underneath. This would work for not just hashtables but also BSTs and alists.