Tuesday, April 29, 2008

USA Zip code database

Recently someone posted a freely available Zip code database on reddit. The database is in CSV format.

Phil Dawes contributed a CSV parser to Factor a few days ago, and I thought this would be the perfect use-case to test out the parser.

I added the library to extra/usa-cities. It begins with the usual boilerplate:
USING: io.files io.encodings.ascii sequences sequences.lib
math.parser combinators kernel memoize csv symbols inspector
words accessors math.order sorting ;
IN: usa-cities

Then, we define some singleton types for the various states of the union. While this isn't strictly necessary, it allows us to write generic words which dispatch on states; for example, I'm sure Doug's taxes library could use this:
SINGLETONS: AK AL AR AS AZ CA CO CT DC DE FL GA HI IA ID IL IN
KS KY LA MA MD ME MI MN MO MS MT NC ND NE NH NJ NM NV NY OH OK
OR PA PR RI SC SD TN TX UT VA VI VT WA WI WV WY ;

: states ( -- seq )
{
AK AL AR AS AZ CA CO CT DC DE FL GA HI IA ID IL IN KS KY
LA MA MD ME MI MN MO MS MT NC ND NE NH NJ NM NV NY OH OK
OR PA PR RI SC SD TN TX UT VA VI VT WA WI WV WY
} ; inline

ERROR: no-such-state name ;

M: no-such-state summary drop "No such state" ;

MEMO: string>state ( string -- state )
dup states [ word-name = ] with find nip
[ ] [ no-such-state ] ?if ;

Next up, we define a data type storing rows from the CSV database:
TUPLE: city
first-zip name state latitude longitude gmt-offset dst-offset ;

Now a word which reads the database, parses it as CSV, and then parses each column into a specific data type:
MEMO: cities ( -- seq )
"resource:extra/usa-cities/zipcode.csv" ascii <file-reader>
csv rest-slice [
7 firstn {
[ string>number ]
[ ]
[ string>state ]
[ string>number ]
[ string>number ]
[ string>number ]
[ string>number ]
} spread city boa
] map ;

This word is tricky; some notes on its workings:
  • We begin by opening a stream for reading from the CSV file with ASCII encoding.
  • The csv word reads CSV data from a stream.
  • The first line of the file consists of column headings and not actual data, so we ignore it by using the non-copying variant of rest, rest-slice (recall that the primary sequence type in Factor is an array, so removing the first element makes a copy; a slice is a virtual sequence presenting a view of a subsequence of an array).
  • The spread combinator takes a sequence of quotations, Q1,...,QN, and n values from the stack, X1,...XN, and outputs Q1[X1],...,QN[XN]. In this case we're taking the first 7 elements of each row of CSV data (each row has exactly 7 columns so in effect we're pushing every element of the sequence on the stack), then using spread to convert some columns from their initial text format into something more useful to us; a state singleton or a number.
  • Finally, we use city boa to construct a city tuple "by order of arguments"; this slurps the 7 stack values and stores them into a new instance of city (note that the definition of the city type has exactly 7 slots and they are defined in the same order as the columns of the file).
  • Finally, we map over the sequence of rows to perform the above steps on each row of the file. The result is a sequence of city instances.

The word is memoized so of course it will only load the database once.

We can now define words to query it:
MEMO: cities-named ( name -- cities )
cities [ name>> = ] with filter ;

MEMO: cities-named-in ( name state -- cities )
cities [
tuck [ name>> = ] [ state>> = ] 2bi* and
] with with filter ;

: find-zip-code ( code -- city )
cities [ first-zip>> <=> ] binsearch* ;

Now, let's look at some examples.

First, let's look up my Zip code:
( scratchpad ) 55406 find-zip-code .
T{ city f 55406 "Minneapolis" MN 44.938615 -93.22082 -6 1 }

And another famous Zip code:
( scratchpad ) 90210 find-zip-code .
T{ city f 90210 "Beverly Hills" CA 34.088808 -118.40612 -8 1 }

How many states have a city named "Minneapolis"?
( scratchpad ) "Minneapolis" cities-named [ state>> ] map prune .
V{ NC MN KS }

What is the possible range of Zip codes for Austin?
( scratchpad ) "Austin" TX cities-named-in [ first-zip>> ] map [ infimum . ] [ supremum . ] bi
73301
78972

There are many possible applications for this library, including form validation in web apps. It could be extended further: if the database was loaded into a quadtree sorted by latitude/longitude, you could perform queries such as finding all towns within 50 miles of a given city.

No comments: