Sunday, January 21, 2007

Removing duplicates, boolean arrays, ARM

Comp Sci lesson learned: For some reason I was always under the impression that one cannot remove duplicates from an array while preserving order in O(n) time. This is why Factor had two words, prune and hash-prune.

The former preserved order and ran in quadratic time; it would loop over the input sequence, and see if each element was already present in an accumulator, if not, it would add it to the accumulator.

The latter did not preserve order and ran in linear time; it would add all elements as keys in a new hashtable, then output the sequence of keys.

Well it turns out there's a way to remove duplicates while keeping order in O(n) time. The idea is that as you inspect each element of the input sequence in turn, you check if it is present in a hashtable; if it is, you keep going, otherwise you push the element on an accumulator and add it to the hashtable. I guess this is obvious to most people but for some reason I was not aware of this algorithm and did not even think this is possible. The implementation is simple enough, and I changed the prune word to use this algorithm, and removed hash-prune:
: (prune) ( hash vec elt -- )
rot 2dup hash-member?
[ 3drop ] [ dupd dupd set-hash swap push ] if ; inline

: prune ( seq -- newseq )
dup length <hashtable> over length <vector>
rot [ >r 2dup r> (prune) ] each nip ;

A new data structure: The Unix native I/O code has had some words for managing a space-efficient bit array for quite some time. These words were not really useful outside of Unix I/O, and were not safe either. I cleaned them up and made them more useful; now Factor has a boolean-array class which supports the sequence protocol and behaves just like an array, except it can only store t or f, and it uses much less space. Dan asked for this to make the XML parser more efficient, and it was about time Factor got a boolean array type, anyway.

ARM: Doug Coleman got me a Netstix ARM computer as an early birthday present. Thanks Doug! I was going to buy one anyway when I felt like porting Factor to ARM, but I guess he really wanted an ARM port now :-) So I've decided to make this port a priority for 0.88. I'm still finishing some compiler changes and setting up the software on the Netstix, but once that is done I will start working on the port. Running on ARM will be nice, because it will open up the world of cellphones and handhelds to Factor. Especially because Factor is both high level and fast.


Anonymous said...

Just curious, but how much memory would a hashtable need if the data was 100 integers with values in the range 0 - 999, and is it possible that for small datasets the quadratic method would actually be faster?

Also, how do you calculate the complexity of the hashtable operations - does it make any guesses at the number of clashes.


Slava Pestov said...

On a 32 bit system, a hashtable with 100 integers consumes about 1600 bytes of memory. Every key/value pair uses 8 bytes, but a hashtable is sized at 50% capacity for optimal hashing.

I assume that hashtable lookup is O(1) with a good hashing function. Poor hashing will degrade performance back down to quadratic time.

Anonymous said...

That looks like not much code. I wonder what the code in C, C++ or Java would look like.

What could be the reason for this?

1. Multiple function calls per line.
2. No explicit specification of parameters to functions - outputs feed straight into inputs (harder to read coz you can't immediately see what's being passed to the function being called if it's not on screen?).
3. Loop constructs all on one line - less readable at first glance than C style code?
4. Maybe factor/forth lends itself to having lots of reusable little words - is this true ??