Monday, December 15, 2008

Prevalence of shuffle words and dataflow combinators

Today on IRC, binarybandit wrote this word:
: pie-chart ( keys values -- url )
[ "|" join ] [ [ number>string ] map "," join ] bi*
"http://chart.apis.google.com/chart?cht=p&chs=500x250&chl=%s&chd=t:%s" sprintf ;

(Side note: it uses the recently-contributed printf vocabulary, which is a nice piece of work: it parses the format string uses PEGs and compiles to quotations at compile time).

This inspired me to make some pie charts showing the frequency of usages of shuffle words and dataflow combinators in the Factor library. So first, we need a word which takes a sequence of words and counts the number of usages of each, normalizing the results so that their sum is 1:
: usage-histogram ( words -- keys values )
[ [ name>> ] [ usage length ] bi ] { } map>assoc
dup values sum '[ _ /f ] assoc-map
sort-values unzip ;

Now we can use a meta-programming trick to extract the list of shufflers from the shuffle words help article:
"shuffle-words" >link uses [ word? ] filter usage-histogram pie-chart print

Here is the resulting chart:

Note that it counts the number of usages of each shuffler, so a word that calls a shuffler more than once is not counted, and a word which uses two distinct shufflers is counted twice. You can see that dup, drop and swap are by far the most popular, and the more complex patterns are rarely used. It is interesting that when beginners first start writing Factor they tend to write code which is heavily biased towards the less frequently-used shufflers. I think teaching dataflow combinators before shufflers can help here, because they make it easier to structure your code in a clean way.

For cleave/spread/apply combinators, we have to do a bit more work to get a list of them programatically, since they're spread over several help articles:
[ "cleave-combinators" "spread-combinators" "apply-combinators" ]
[ >link uses ] map concat { either? both? } diff
[ word? ] filter

Now we can make the pie chart as before. Here is the result:

Again, bi completely dominates.

I'm not sure what to make of these results, other than that the Google Charts API is pretty nifty, and that there is untapped potential for "code data mining" in Factor's reflection capabilities. We could use this to discover potential abstractions and patterns.

No comments: