Saturday, June 07, 2008

Cleave combinators and URL objects

Expressing intent with cleave combinators

Even though the cleave combinators (bi, tri, bi*, tri*, and various others) are equivalent to a chain of keeps, eg,
[ A ] [ B ] [ C ] tri == [ A ] keep [ B ] keep C

They shouldn't always be used as a substitute for keep and other forms of stack shuffling. This is because the cleave form expresses intention. In particular, they probably should not be used if the quotations take more than one input.

For example, here is a piece of code in cairo.gadgets:
M: png-gadget render*
path>> normalize-path cairo_image_surface_create_from_png
[ cairo_image_surface_get_width ]
[ cairo_image_surface_get_height 2array dup 2^-bounds ]
[ [ copy-surface ] curry copy-cairo ] tri
GL_BGRA render-bytes* ;

I would instead write it as follows:
M: png-gadget render*
path>> normalize-path cairo_image_surface_create_from_png
[ cairo_image_surface_get_width ]
[ cairo_image_surface_get_height ]
bi 2array dup 2^-bounds
[ [ copy-surface ] curry ] bi
GL_BGRA render-bytes* ;

While this is a bit longer I think it is easier to understand.

The reason I'm not interested in enforcing this with a type system, however, is because there are exceptions.

One exception is when the quotations passed to the cleaver have a 'pipeline' effect ( obj val -- obj ). Then I think it is pretty clear what is going on if you do this -- you're storing the same value into two slots:
[ >>a ] [ >> b ] bi

or this -- you're storing three values into slots:
[ >>a ] [ >>b ] [ >>c ] tri*

For example, suppose you have a word which take a string and parses it, outputting three values:
: parse-foo ( str -- a b c )

And you want to write a constructor for a tuple type:
: <bar> ( str d e -- bar )
bar new
swap >>e
swap >>d
swap parse-foo [ >>a ] [ >>b ] [ >>c ] tri* ;

A concrete example is the URL parsing code in the urls vocabulary. Here are a couple of pretty long (but hopefully straightforward) definitions which demonstrate this idiom:
: parse-host-part ( url protocol rest -- url string' )
[ >>protocol ] [
"//" ?head [ "Invalid URL" throw ] unless
"@" split1 [
[ ":" split1 [ >>username ] [ >>password ] bi* ] dip
] when*
"/" split1 [
parse-host [ >>host ] [ >>port ] bi*
] [ "/" prepend ] bi*
] bi* ;

M: string >url
<url> swap
":" split1 [ parse-host-part ] when*
"#" split1 [
"?" split1
[ url-decode >>path ]
[ [ query>assoc >>query ] when* ] bi*
[ url-decode >>anchor ] bi* ;

First-class URLs

Which brings me to my next topic, URLs. I'm a big fan of using the most specific data types possible for everything: timestamps instead of integers with the implicit convention that they store milliseconds since an epoch, time durations instead of integers (distinct from timestamps), symbols instead of magic strings and numbers, specialized arrays where possible, and so on. I also think that path names, money and dimensioned units should be first-class, instead of strings and numbers, respectively; while we have libraries for those latter three they're not used as much as the former (a shame, really).

While working on the routing and redirection part of Factor's web framework, I noticed a lot of boilerplate coming up that I couldn't easily abstract away. The natural solution seemed to be to promote URLs to a first-class data type instead of passing strings around.

URLs are very easy to use. First you have literal URLs:

You can also construct them from strings at runtime:
"" >url

Or you can construct them from components:
"ftp" >>protocol
"" >>host
"/pub" >>path

Setting query parameters is supported, and the URL code takes care of URL encoding and decoding for you:
( scratchpad ) URL""
( scratchpad ) "query" query-param .
"Hello World"

Likewise, you can set query parameters with set-query-param.

URLs support two main operations: converting back to a string, and deriving a URL from a base URL and a relative URL.

For converting URLs to a string, I decided to make a new generic word and put a method on that for URLs, instead of defining a word specific to URLs. The word is called present and it is in the present vocabulary. It has methods for the following types, so far:
  • strings
  • numbers
  • words
  • timestamps
  • URLs

Unlike unparse, which converts an object into source code, present outputs a string which is more human-readable, and the process is possibly lossy. For example, strings present as themselves, and numbers present as their string representation. Timestamps present in RFC822 format.

Deriving a URL from a base URL and a relative URL allows you to take a URL with some components possibly missing, and fill them in from a base URL. The web framework uses this to ensure that all HTTP redirects sent to the client are absolute.

I refactored Daniel Ehrenberg's yahoo library to use URL objects for constructing the search query. The code is really concise now; just build a URL, pass it to http-get, and parse the XML result to build an array of search result objects. Parsing the XML only takes 5 lines of code!

Sometime soon, Doug Coleman will update his FTP client library to use URL objects also, at which point we can define a protocol for reading and writing generic URLs, allowing code to be written which can deal with both FTP, HTTP, and HTTPS abstractly.

No comments: