Saturday, May 24, 2008

Removing duplicates from a sequence

Albert van der Horst posted to comp.lang.forth about an "elementary but surprisingly difficult problem"; here's a snippet:

An area giving as (addr_start, addr_end) contains items
of length ``len''. I want to get of the duplicates.

Sorting the items is pretty easy, qsort can handle that.
Now eliminating the duplicates, we need only compare one
item to the next.
Something along compact (start, end, len -- end' )
The new end should be returned, of course.

Even using variables/locals, I find this difficult.


The problem with Forth is that while it gives you all the building blocks needed to make high-level abstractions, it doesn't actually give you any pre-packaged ones and there's no culture of code sharing or library development in the Forth community either, so everybody has to reinvent basic sequence processing operations. In contrast, Factor's collection library makes this problem trivial.

Suppose S is our sequence, already sorted. First of all, S[0] is always in the resulting sequence, R. For n > 1, the S[n] is in the result R only if S[n] != S[n-1]. Suppose we make a new sequence, T, such that T[n] is the pair (S[n],S[n+1]). Then we can get our R by removing all pairs from T such that both elements are equal, and then mapping over the result to take the second element form each pair, and finally prepending S[0] to this.

Here is the Factor implementation:
[ 2 clump [ all-equal? not ] filter [ second ] map ] [ first ] bi prefix

No variables, locals or stack shuffling needed.

Here is how the code works:

2 clump takes a sequence on the stack and gives a new sequence of '2 element clumps':

( scratchpad ) { 1 2 2 3 4 4 4 5 6 } 2 clump .
{
{ 1 2 }
{ 2 2 }
{ 2 3 }
{ 3 4 }
{ 4 4 }
{ 4 4 }
{ 4 5 }
{ 5 6 }
}

Next, all-equal? checks if all elements in an array are equal. We want to iterate over the above sequence of pairs, discarding those where both elements are equal; this is what [ all-equal? not ] filter achieves:
( scratchpad ) { 1 2 2 3 4 4 4 5 6 }
( scratchpad ) 2 clump [ all-equal? not ] filter .
{ { 1 2 } { 2 3 } { 3 4 } { 4 5 } { 5 6 } }

Next, we want to take the second element of each pair, [ second ] map:
( scratchpad ) { 1 2 2 3 4 4 4 5 6 }
( scratchpad ) 2 clump [ all-equal? not ] filter [ second ] map .
{ 2 3 4 5 6 }

Finally, we use bi because we do two things to the array; we first find the second element of each non-duplicate clump, then we take the first element of the original array:
( scratchpad ) { 1 2 2 3 4 4 4 5 6 }
( scratchpad ) [ 2 clump [ all-equal? not ] filter [ second ] map ] [ first ] bi . .
1
{ 2 3 4 5 6 }

Now that we have two things on the stack, in the right order, we call prefix to add that element to the original array:
( scratchpad ) { 1 2 2 3 4 4 4 5 6 }
( scratchpad ) [ 2 clump [ all-equal? not ] filter [ second ] map ] [ first ] bi prefix .
{ 1 2 3 4 5 6 }

Of course, this approach allocates some unnecessary intermediate structure. However, it took me about 20 seconds to write, and it worked the first time, I didn't have to mess around with variables, locals, or stack operations.

It works with other data types also:
( scratchpad ) { "a man" "a man" "a plan" "a canal" "panama" "panama" }
( scratchpad ) [ 2 clump [ all-equal? not ] filter [ second ] map ] [ first ] bi prefix .
{ "a man" "a plan" "a canal" "panama" }


Update: Daniel Ehrenberg came up with an even shorter version; figuring it out is an exercise for the reader: [ 2 clump [ = not ] assoc-filter values ] [ first ] bi prefix.

Now, removing duplicates is so common that Factor already has a word in the sets vocabulary named prune which does exactly this, but with a different algorithm:

If you sort then iterate, you get an O(n log n) algorithm, and also it depends on there being an intrinsic order among the elements. Instead, you can use a hashtable; if you have good hashcodes, then in theory you get an O(n) algorithm (but the constant factors might be worse). The idea is that instead you iterate over your sequence, and check if each element is in the hashtable. If it's not there, it's a new element, and you flag it as such then add it to the hashtable. If it's already in the hashtable, you skip it.

3 comments:

Anonymous said...

what do assoc-filter word do ?

Slava Pestov said...

\ assoc-filter help

Unknown said...

What if the incoming sequence is not sorted? That was the original problem description.