Phil Dawes contributed a CSV parser to Factor a few days ago, and I thought this would be the perfect use-case to test out the parser.
I added the library to
extra/usa-cities
. It begins with the usual boilerplate:USING: io.files io.encodings.ascii sequences sequences.lib
math.parser combinators kernel memoize csv symbols inspector
words accessors math.order sorting ;
IN: usa-cities
Then, we define some singleton types for the various states of the union. While this isn't strictly necessary, it allows us to write generic words which dispatch on states; for example, I'm sure Doug's
taxes
library could use this:SINGLETONS: AK AL AR AS AZ CA CO CT DC DE FL GA HI IA ID IL IN
KS KY LA MA MD ME MI MN MO MS MT NC ND NE NH NJ NM NV NY OH OK
OR PA PR RI SC SD TN TX UT VA VI VT WA WI WV WY ;
: states ( -- seq )
{
AK AL AR AS AZ CA CO CT DC DE FL GA HI IA ID IL IN KS KY
LA MA MD ME MI MN MO MS MT NC ND NE NH NJ NM NV NY OH OK
OR PA PR RI SC SD TN TX UT VA VI VT WA WI WV WY
} ; inline
ERROR: no-such-state name ;
M: no-such-state summary drop "No such state" ;
MEMO: string>state ( string -- state )
dup states [ word-name = ] with find nip
[ ] [ no-such-state ] ?if ;
Next up, we define a data type storing rows from the CSV database:
TUPLE: city
first-zip name state latitude longitude gmt-offset dst-offset ;
Now a word which reads the database, parses it as CSV, and then parses each column into a specific data type:
MEMO: cities ( -- seq )
"resource:extra/usa-cities/zipcode.csv" ascii <file-reader>
csv rest-slice [
7 firstn {
[ string>number ]
[ ]
[ string>state ]
[ string>number ]
[ string>number ]
[ string>number ]
[ string>number ]
} spread city boa
] map ;
This word is tricky; some notes on its workings:
- We begin by opening a stream for reading from the CSV file with ASCII encoding.
- The
csv
word reads CSV data from a stream. - The first line of the file consists of column headings and not actual data, so we ignore it by using the non-copying variant of
rest
,rest-slice
(recall that the primary sequence type in Factor is an array, so removing the first element makes a copy; a slice is a virtual sequence presenting a view of a subsequence of an array). - The
spread
combinator takes a sequence of quotations,Q1,...,QN
, and n values from the stack,X1,...XN
, and outputsQ1[X1],...,QN[XN]
. In this case we're taking the first 7 elements of each row of CSV data (each row has exactly 7 columns so in effect we're pushing every element of the sequence on the stack), then usingspread
to convert some columns from their initial text format into something more useful to us; a state singleton or a number. - Finally, we use
city boa
to construct a city tuple "by order of arguments"; this slurps the 7 stack values and stores them into a new instance ofcity
(note that the definition of thecity
type has exactly 7 slots and they are defined in the same order as the columns of the file). - Finally, we
map
over the sequence of rows to perform the above steps on each row of the file. The result is a sequence ofcity
instances.
The word is memoized so of course it will only load the database once.
We can now define words to query it:
MEMO: cities-named ( name -- cities )
cities [ name>> = ] with filter ;
MEMO: cities-named-in ( name state -- cities )
cities [
tuck [ name>> = ] [ state>> = ] 2bi* and
] with with filter ;
: find-zip-code ( code -- city )
cities [ first-zip>> <=> ] binsearch* ;
Now, let's look at some examples.
First, let's look up my Zip code:
( scratchpad ) 55406 find-zip-code .
T{ city f 55406 "Minneapolis" MN 44.938615 -93.22082 -6 1 }
And another famous Zip code:
( scratchpad ) 90210 find-zip-code .
T{ city f 90210 "Beverly Hills" CA 34.088808 -118.40612 -8 1 }
How many states have a city named "Minneapolis"?
( scratchpad ) "Minneapolis" cities-named [ state>> ] map prune .
V{ NC MN KS }
What is the possible range of Zip codes for Austin?
( scratchpad ) "Austin" TX cities-named-in [ first-zip>> ] map [ infimum . ] [ supremum . ] bi
73301
78972
There are many possible applications for this library, including form validation in web apps. It could be extended further: if the database was loaded into a quadtree sorted by latitude/longitude, you could perform queries such as finding all towns within 50 miles of a given city.
No comments:
Post a Comment