Saturday, May 10, 2008

Garbage collection throughput improvements

The problem


About a month ago, I made some changes to the garbage collector. These changes included drastically decreasing the default nursery size from 16mb to 1mb. This turned out to have a negative effect on performance, with GC consuming as much as 60% of run time on some tests.

Instead of making the default nursery larger, which would just mask the problem until we hit workloads with 16 times as much allocation (or when the compiler began to generate code that was 16 times faster, or any combination of the above), I decided to investigate the root cause of the problem.

Turns out that every minor GC was taking on the order of a few milliseconds at the minimum, which is too much time. Most of the time was spent scanning the cards array for marked cards. Even if there weren't any marked cards, with a 64mb heap and 64 byte card size, that's a million array entries to check.

Faster card scanning


The first thing I did was change how the card scan loop operates. Instead of iterating over the array a byte at a time, and checking if each byte satisfies the card mark mask, we iterate over the array four bytes at a time, checking each group of four bytes against the mask. GCC should really be doing this optimization for me, because its a simple case of vectorization, but it doesn't, so I just wrote the code out by hand:
u32 *quad_ptr;
u32 quad_mask = mask | (mask << 8) | (mask << 16) | (mask << 24);

for(quad_ptr = (u32 *)first_card; quad_ptr < (u32 *)last_card; quad_ptr++)
{
if(*quad_ptr & quad_mask)
{
F_CARD *ptr = (F_CARD *)quad_ptr;

int card;
for(card = 0; card < 4; card++)
{
if(ptr[card] & mask)
{
collect_card(&ptr[card],gen,here);
ptr[card] &= ~unmask;
}
}
}
}

This improved card scan performance by a factor of 3, which makes sense, because we do four times less loop iterations, but at some point, you still have to hit memory, which dampens things out a bit.

This optimization improved performance, but not sufficiently so. I realized that card scanning hurt the most in benchmarks which didn't write into objects in the old generation at all; for example, a benchmark which operates on bignums, where the intermediate values are extremely short-lived, fills up the nursery very rapidly, and every minor GC checks all cards, which not only fills up the cache with junk, but doesn't achieve anything useful as no cards will be marked.

On the other hard, a pointer recording approach where all stores are written into a buffer is also unacceptable, because in code with a high rate of mutation, you pay a hefty price every time the buffer overflows. So I had to find a compromise.

Introducing card decks


My solution is something I haven't seen done anywhere else before, but it is pretty trivial if you think about it. It amounts to two-level card marking. We have an array of cards, where each card maps to a small amount of memory: 64 bytes in my case, but it could be anything; and an array of card decks. Each card deck corresponds to 1024 cards, and if a card is marked, then its deck must be marked too. This way, on a minor GC, we scan the deck array first, and then only check cards corresponding to marked decks.

This complicates the write barrier, because now on every store we have to mark a card and a deck, not just a card as before. I decided to counterweight this by simplifying the write barrier in another way. Formerly, it would read the card, "or" it with a mask, and store it back. This was done because the card contained an additional piece of information, and that is the offset within the card where the first object begins. Cards have two mark bits, denoting a possible pointer from this generation to either aging space or the nursery, and the other six bits were reserved for the offset of the first object. By moving object offsets (I call them "allot markers") into a separate array, I removed a read operation from the write barrier; now it simply has to write the mark mask to the card, without caring about its existing contents. This also freed up two bits in the allot marker, allowing me to switch to 256 byte cards (and 256 kilobyte card decks).

These improvements really helped on benchmarks which allocated large numbers of extremely short-lived objects, however on benchmarks which allocated longer-lived objects the improvements were still major but not so drastic. It turns out that if objects survives long enough to be copied from the nursery into aging space, then it is possible for this copying to begin to dominate running time.

Low-level optimizations


The inner loop of the GC looked like so:
void copy_handle(CELL *handle)
{
CELL pointer = *handle;

if(!immediate_p(pointer) && should_copy(pointer))
*handle = copy_object(pointer);
}

CELL collect_next(CELL scan)
{
do_slots(scan,copy_handle);

if(collecting_gen == TENURED)
do_code_slots(scan);

return scan + untagged_object_size(scan);
}

This is very elegant code; do_slots is a higher-order function which applies a function pointer to each slot of the object. However, GCC doesn't try to optimize higher-order code at all, after all it is a C compiler not an ML compiler, and it isn't Sufficiently Smart Either! So I rewrote the above function by manually inlining do_slots() and replacing the function pointer invocation with the body of copy_handle(). The next thing I realized was that should_copy() contains a large series of conditional tests which did not depend on its input parameters, only global variables whose value was invariant for the duration of the collection:
/* test if the pointer is in generation being collected, or a younger one. */
INLINE bool should_copy(CELL untagged)
{
if(in_zone(newspace,untagged))
return false;
if(collecting_gen == TENURED)
return true;
else if(HAVE_AGING_P && collecting_gen == AGING)
return !in_zone(&data_heap->generations[TENURED],untagged);
else if(HAVE_NURSERY_P && collecting_gen == NURSERY)
return in_zone(&nursery,untagged);
else
{
critical_error("Bug in should_copy",untagged);
return false;
}
}

So again, I did a bit of hand-optimization. If foo() does not depend on the loop variable or the side effects of the body, then the following are equivalent:
while(A)
{
if(foo()) B else C;
}

if(foo)
{
while(A) B;
}
else
{
while(A) C;
}

In the optimizing compiler literature, this is known as loop unswitching.

The new collect_next() is much longer, but also much faster. It looks like this:
CELL collect_next(CELL scan)
{
CELL *obj = (CELL *)scan;
CELL *end = (CELL *)(scan + binary_payload_start(scan));

obj++;

CELL newspace_start = newspace->start;
CELL newspace_end = newspace->end;

if(HAVE_NURSERY_P && collecting_gen == NURSERY)
{
CELL nursery_start = nursery.start;
CELL nursery_end = nursery.end;

for(; obj < end; obj++)
{
CELL pointer = *obj;

if(!immediate_p(pointer)
&& (pointer >= nursery_start && pointer < nursery_end))
*obj = copy_object(pointer);
}
}
else if(HAVE_AGING_P && collecting_gen == AGING)
{
F_ZONE *tenured = &data_heap->generations[TENURED];

CELL tenured_start = tenured->start;
CELL tenured_end = tenured->end;

for(; obj < end; obj++)
{
CELL pointer = *obj;

if(!immediate_p(pointer)
&& !(pointer >= newspace_start && pointer < newspace_end)
&& !(pointer >= tenured_start && pointer < tenured_end))
*obj = copy_object(pointer);
}
}
else if(collecting_gen == TENURED)
{
for(; obj < end; obj++)
{
CELL pointer = *obj;

if(!immediate_p(pointer)
&& !(pointer >= newspace_start && pointer < newspace_end))
*obj = copy_object(pointer);
}

do_code_slots(scan);
}
else
critical_error("Bug in collect_next",0);

return scan + untagged_object_size(scan);
}

That's a hell of a code explosion, but it made a real difference to the performance of the garbage collector.

Benchmarks


The first benchmark just allocates 3-element arrays in a loop:
: garbage 1 2 3 3array ;

: garbage-loop 150000000 [ garbage drop ] times ;

[ garbage-loop ] time

Here are the results with different nursery sizes; all times are in milliseconds and I ran each benchmark multiple times to ensure I was getting reliable results:
BeforeAfter
1mb nursery:89084461
2mb nursery:64554451
4mb nursery:53364582
8mb nursery:47794582

The second benchmark allocates larger arrays in a loop:
: garbage 100 f <array> ;

: garbage-loop 15000000 [ garbage drop ] times ;

[ garbage-loop ] time

The results are similar:
BeforeAfter
1mb nursery:104182158
2mb nursery:6364 2178
4mb nursery:4966 3223
8mb nursery:4249 3294

An interesting phenomenon occurs where after the GC changes, a larger nursery actually begins to hurt performance. Dan and I hypothesized that this is due to the larger nursery hurting locality, with poor algorithms masking this effect with the older GC.

The next benchmark allocates a lot of temporary arrays but stores them into larger arrays which are somewhat longer lived:
: garbage 100 f <array> ;

: garbage-loop 150 [ 10000 [ drop garbage ] map drop ] times ;

[ garbage-loop ] time

The size of the aging generation makes a difference here so I tried the benchmark with more parameters:
BeforeAfter
1mb nursery/2mb aging:86935525
2mb nursery/2mb aging:73185255
4mb nursery/2mb aging:39392628
8mb nursery/2mb aging:23141526
1mb nursery/8mb aging:34551493
4mb nursery/4mb aging:30241894
4mb nursery/4mb aging:31571577
8mb nursery/8mb aging:15331003

The above benchmarks are somewhat unrealistic, however other benchmarks showed speedups too, some more drastic than others:


BenchmarkBeforeAfter
SHA1121987282
Random1898914459
Sort1568513476
Raytracer2770012966
Mandelbrot46331847

Before the shrinking of the nursery, the runtimes of these benchmarks looked much like they do now; it was only until a month ago that GC time began to dominate benchmarks like that. However I like having a small nursery, and making it smaller forced me to optimize the garbage collector for higher throughput in constrained-memory conditions.

Looking ahead


The next big change I will be making to the garbage collector is switching to using mark/sweep for the old generation. This will reduce memory consumption by almost 50%.

The other side of this coin is compiler optimizations. If the compiler performed more advanced static analysis, it could eliminate much of the dynamic memory allocation in the above benchmarks, cutting GC time further. Of course, if we have a great compiler and a great GC, we'll just have great performance all around.

Monday, May 05, 2008

I/O changes, and process pipeline support

I made some improvements to the I/O system today.

Default stream variables


The stdio variable has been replaced by input-stream and output-stream, and there are four new words:

The first two close the stream after, the latter do not. The with-stream and with-stream* words are still around, they expect a duplex stream, unpack it, and bind both variables.

I've changed many usages of with-stream to use one of the unidirectional variants instead. This means that you can now write code like the following:
"foo.txt" utf8 [
"blah.txt" utf8 [
... read from the first file, write to the second file,
using read and write ...
] with-file-writer
] with-file-reader

Before you had to use this really ugly "design pattern":
"foo.txt" utf8 <file-reader> [
"blah.txt" utf8 <file-writer> [
<duplex-stream> [
...
] with-stream
] with-disposal
] with-disposal

Speaking of duplex streams, because they're not used by anything in the core anymore I have moved them to extra. They are still used by <process-stream> and <client>. I added a <process-reader> and <process-writer> word for those cases where you only want a pipe in one direction; they express intention better.

Pipes


The <process-stream> word has been around for a while, and this word used pipes internally, but they were not exposed in a nice, cross-platform way, until now.

The io.pipes vocabulary contains two words:

Appending process output to files


The io.launcher vocabulary supports the full array of input and output redirection features, and now, pipelines. There was one missing component: redirecting process output to a file opened for appending. Now this is possible. The following Factor code:
<process>
"do-stuff" >>command
"log.txt" <appender> >>stderr
run-process

Corresponds to this shell command:
do-stuff 2>> do-stuff.txt

Of course, shell script is a DSL for launching processes so it is more concise than Factor. However, Factor's io.launcher library now supports all of the features that the shell does, and its pretty easy to build a shell command parser using Chris Double's PEG library, which translates shell commands to sequences of Factor process descriptors in a pipeline.

And now, I present a concise illustration of the difference between the Unix philosophy and the Windows philosophy. Here we have two pieces of code, which do the exact same thing: create a new pipe, open both ends, return a pair of handles.

Unix:
USING: system alien.c-types kernel unix math sequences
qualified io.unix.backend io.nonblocking ;
IN: io.unix.pipes
QUALIFIED: io.pipes

M: unix io.pipes:(pipe) ( -- pair )
2 "int" <c-array>
dup pipe io-error
2 c-int-array> first2
[ [ init-handle ] bi@ ] [ io.pipes:pipe boa ] 2bi ;


Windows:
USING: alien alien.c-types arrays destructors io io.windows libc
windows.types math.bitfields windows.kernel32 windows namespaces
kernel sequences windows.errors assocs math.parser system random
combinators accessors io.pipes io.nonblocking ;
IN: io.windows.nt.pipes

: create-named-pipe ( name -- handle )
{ PIPE_ACCESS_INBOUND FILE_FLAG_OVERLAPPED } flags
PIPE_TYPE_BYTE
1
4096
4096
0
security-attributes-inherit
CreateNamedPipe
dup win32-error=0/f
dup add-completion
f <win32-file> ;

: open-other-end ( name -- handle )
GENERIC_WRITE
{ FILE_SHARE_READ FILE_SHARE_WRITE } flags
security-attributes-inherit
OPEN_EXISTING
FILE_FLAG_OVERLAPPED
f
CreateFile
dup win32-error=0/f
dup add-completion
f <win32-file> ;

: unique-pipe-name ( -- string )
[
"\\\\.\\pipe\\factor-" %
pipe counter #
"-" %
32 random-bits #
"-" %
millis #
] "" make ;

M: winnt (pipe) ( -- pipe )
[
unique-pipe-name
[ create-named-pipe dup close-later ]
[ open-other-end dup close-later ]
bi pipe boa
] with-destructors ;

The Windows API makes things difficult for no reason. Anonymous pipes do not support overlapped I/O, so you have to open a named pipe with a randomly-generated name (I'm not making this up, many other frameworks do the same thing and Microsoft even recommends this approach).

On the flipside, the nice thing about supporting both Unix and Windows is that I get to come up with true high-level abstractions that make sense, instead of being able to get away with having thin wrappers over Unix system calls as so many other language implementations do. For example, Ocaml's idea of "high-level I/O" is some basic POSIX bindings, together with incomplete emulation of Unix semantics on Windows. Look forward to writing a page of code if you want to map a file into memory or run a process and read its output into a string. And of course Java doesn't support pipes, I/O redirection for launching processes, or file system change monitoring, at all.

Tuesday, April 29, 2008

USA Zip code database

Recently someone posted a freely available Zip code database on reddit. The database is in CSV format.

Phil Dawes contributed a CSV parser to Factor a few days ago, and I thought this would be the perfect use-case to test out the parser.

I added the library to extra/usa-cities. It begins with the usual boilerplate:
USING: io.files io.encodings.ascii sequences sequences.lib
math.parser combinators kernel memoize csv symbols inspector
words accessors math.order sorting ;
IN: usa-cities

Then, we define some singleton types for the various states of the union. While this isn't strictly necessary, it allows us to write generic words which dispatch on states; for example, I'm sure Doug's taxes library could use this:
SINGLETONS: AK AL AR AS AZ CA CO CT DC DE FL GA HI IA ID IL IN
KS KY LA MA MD ME MI MN MO MS MT NC ND NE NH NJ NM NV NY OH OK
OR PA PR RI SC SD TN TX UT VA VI VT WA WI WV WY ;

: states ( -- seq )
{
AK AL AR AS AZ CA CO CT DC DE FL GA HI IA ID IL IN KS KY
LA MA MD ME MI MN MO MS MT NC ND NE NH NJ NM NV NY OH OK
OR PA PR RI SC SD TN TX UT VA VI VT WA WI WV WY
} ; inline

ERROR: no-such-state name ;

M: no-such-state summary drop "No such state" ;

MEMO: string>state ( string -- state )
dup states [ word-name = ] with find nip
[ ] [ no-such-state ] ?if ;

Next up, we define a data type storing rows from the CSV database:
TUPLE: city
first-zip name state latitude longitude gmt-offset dst-offset ;

Now a word which reads the database, parses it as CSV, and then parses each column into a specific data type:
MEMO: cities ( -- seq )
"resource:extra/usa-cities/zipcode.csv" ascii <file-reader>
csv rest-slice [
7 firstn {
[ string>number ]
[ ]
[ string>state ]
[ string>number ]
[ string>number ]
[ string>number ]
[ string>number ]
} spread city boa
] map ;

This word is tricky; some notes on its workings:

The word is memoized so of course it will only load the database once.

We can now define words to query it:
MEMO: cities-named ( name -- cities )
cities [ name>> = ] with filter ;

MEMO: cities-named-in ( name state -- cities )
cities [
tuck [ name>> = ] [ state>> = ] 2bi* and
] with with filter ;

: find-zip-code ( code -- city )
cities [ first-zip>> <=> ] binsearch* ;

Now, let's look at some examples.

First, let's look up my Zip code:
( scratchpad ) 55406 find-zip-code .
T{ city f 55406 "Minneapolis" MN 44.938615 -93.22082 -6 1 }

And another famous Zip code:
( scratchpad ) 90210 find-zip-code .
T{ city f 90210 "Beverly Hills" CA 34.088808 -118.40612 -8 1 }

How many states have a city named "Minneapolis"?
( scratchpad ) "Minneapolis" cities-named [ state>> ] map prune .
V{ NC MN KS }

What is the possible range of Zip codes for Austin?
( scratchpad ) "Austin" TX cities-named-in [ first-zip>> ] map [ infimum . ] [ supremum . ] bi
73301
78972

There are many possible applications for this library, including form validation in web apps. It could be extended further: if the database was loaded into a quadtree sorted by latitude/longitude, you could perform queries such as finding all towns within 50 miles of a given city.

An addendum to "The new HTTP server, part 2"

If you run the web app presented in the last blog post verbatim, you will get a "500 Internal server error" with no further indication of what's going wrong. This is because the code I presented has a minor omission.

The opaque error message is intentional: if your web app crashes, you don't necessarily want to expose internal details to every user that comes along (one famous case was reddit.com, which leaked a portion of their Python codebase inside a stack trace at some point). However, if you set the development-mode global variable to a true value, the behavior of the HTTP server changes in two respects:

If we enable development mode, we see the real error message, "No such table: SESSIONS". This is because I didn't mention that one must initialize the database by creating the table for storing sessions first:
"counter.db" sqlite-db [ init-sessions-table ] with-db

In the next installment of this series, which hopefully won't take as long as the second one did, I will discuss form validation and the templating system.

Write once, run anywhere

Apple finally released Java 6 for Mac OS X, about a year late, and it only runs on 64-bit Intel Macs. It also requires Leopard, which isn't such a big deal because anyone can upgrade. I sold my G5 but a lot of people out there still use PowerPC Macs and I guess they're just screwed. It's a shame too because Java 6 is the first release that's starting to approach true usability for desktop applications.

The new HTTP server, part 2

It's been a month and a half since the first part of this series. Why the long delay? I've been busy with other things. I implemented inheritance, various compiler optimizations, and many other things. In the last couple of weeks I've been working on the web framework again, tying up some loose ends and porting more existing web applications over (namely, the pastebin and planet factor).

In this entry I will talk about session management. Session management was one of the first things I implemented in the new framework when I started working on it, but recently I gave the code an overhaul.

Session management


The basic idea behind session management is that while HTTP is a stateless protocol, we can simulate state by sending a token to the client -- either in the form of a hidden element on the page, or a cookie, which the client sends back to the server with a later request. This token is associated with an object on the server and the object holds state between requests.

Another approach for session management is to store state entirely on the client; instead of sending the client a session ID identifying an object on the server, you send the session data itself to the client. Traditionally this approach has only been used for user preferences and such where security is immaterial, but it can even be used for more sensitive data by encrypting it with a private key only known to the server. The client receives an opaque blob of binary data which cannot be inspected or tampered with (unless the public key encryption algorithm being used is compromised).

Currently Factor's session manager does not support client-side sessions, but it will soon, using Doug Coleman's public-key encryption code. Server-side sessions are supported, however.

The session manager uses two main strategies to pass state to the client:

The idea is to strike a balance between security and convenience; we don't want to add a session ID to every link and start a new session if the user navigates to the site by directly entering a URL, but on the other hand we don't want potentially destructive POST requests to be accepted unless they were sent by a form generated from within the session itself.

In Factor, a session is simply a hashtable where values can be stored. Keys are known as "session variables" and values can be read and written with the sget and sset words, there's also a schange combinator which applies a quotation is applied to an existing session variable to yield a new value. This all entirely analogous to the get/set/change words for dynamic variables.

Session namespaces are serialized and stored in a database using Doug's db.tuples O/R mapper. I originally supported pluggable "session storage" backends, with database storage and in-memory storage as the two options, however I decided to simplify the code and hardcode database storage. This has the side-effect that you'll need to set up a database to use the session management feature, however SQLite presents a lightweight option which requires no configuration, so I don't think this is a big deal at all.

I will show a small example of a 'counter' web application, much like the counter example for the Seaside framework.

We start off with a vocabulary search path:
USING: math kernel accessors http.server http.server.actions
http.server.sessions http.server.templating.fhtml locals ;
IN: webapps.counter

Now, we define a symbol used to key a session variable:
SYMBOL: count

Next, we define a pair of actions which increment the counter value, using the schange combinator. The display slot of an action contains code to be executed upon a GET request; it is expected to output a response object. In our case, the word outputs an action which applies the quotation to the current counter value; the action outputs a response which redirects back to the main page:
:: <counter-action> ( quot -- action )
<action> [
count quot schange
"" f <standard-redirect>
] >>display ;

The action to decrement the counter is entirely analogous:
: <dec-action> ( -- action )
<action> [ count [ 1- ] schange f ] >>display ;

Note that this word constructs actions, instead of invoking them. This approach is more flexible than the old "furnace" web framework, where actions were mapped directly to word execution, because it allows one to write "higher-order actions" parametrized by values more easily.

Here is the default action; it displays the counter value using a template:
: <counter-action> ( -- action )
<action> [
"resource:extra/webapps/counter/counter.fhtml" <fhtml>
] >>display ;

Finally we put everything together in a dispatcher:
: <counter-app> ( -- responder )
counter-app new-dispatcher
[ 1+ ] <counter-action> "inc" add-responder
[ 1- ] <counter-action> "dec" add-responder
<display-action> "" add-responder
<sessions> ;

We create a dispatcher, add instances of our actions to it, and wrap the whole thing in a session manager.

Now, the template:
<% USING: io math.parser http.server.sessions webapps.counter ; %>

<html>
<body>
<h1><% count sget number>string write %></h1>

<a href="inc">++</a>
<a href="dec">--</a>
</body>
</html>

Finally, once we have all the parts, we can create the counter responder and start the HTTP server:
<counter-app> "test.db" sqlite-db <db-persistence> main-responder set
8888 httpd

Note that here we wrap the counter responder in another layer of indirection, this time for database persistence; while the counter web app doesn't use persistence the session manager does, and we chose to use SQLite since it requires no configuration or external services.

Navigating over to http://localhost:8888/ should now display the counter app, and clicking the increment and decrement links should have an effect on the displayed value. Sessions persist between server restarts and time out after 20 minutes of inactivity by default. Looking at your web browser's cookie manager will show that a factorsessid cookie has been set.

As an aside, the Seaside version uses continuations to maintain state. The Factor version explicitly maintains state. Even though I ported Chris Double's modal web framework over to the new HTTP server, I'm avoiding continuations in favor of explicit state for now. I am building up a form component framework with validation, easy persistence, and user authentication without resorting to continuations, and I plan on building a state-machine model with a page flow DSL, much like jBPM, to handle more complex multi-page flows such as shopping carts. While this will result in more work for me, I believe the benefits include transparent support for load-balancing and fail-over, readable URLs, and ultimately, simpler and more reusable web application code because page flow can be decoupled from logic and expressed in a custom DSL intended for that purpose.

The Seaside version is also somewhat shorter; it is easy to express with idiomatic Seaside (transparent session management, presentation logic mixed in with web app code). I will add better abstractions to make up for some of the difference, and for larger applications there should be no difference in code size; in fact since the scope of Factor's framework is wider than Seaside (it covers persistence, authentication and validation, and soon, versioning of persistent entities) you might even need less code to accomplish the same thing.

Virtual hosting


The other topic I promised to cover last time was virtual hosting. Virtual hosting is done with dispatchers, much like nested directory structure is. You create a virtual host dispatcher with <vhost-dispatcher> and add responders for various virtual hosts using add-responder; the >>default slot can be used to set the default virtual host. The key difference between the new approach and the old HTTP server virtual hosting implementation, which relied on a global hashtable mapping virtual host names to responders, is flexibility; the virtual host dispatcher does not necessarily have to be your top-level responder.

For example, the <boilerplate> responder gives you a way of enforcing a common look and feel across a set of web apps, by adding common headers and footers to every page. While I will describe boilerplate responders and the template system in more detail in a later post, for now here is an example:
<vhost-dispatcher>
<online-store> "store.acme.com" add-responder
<support-site> "support.acme.com" add-responder
<main-site> "acme.com" add-responder
<boilerplate>
"acme-site.xml" >>template
acme-db <db-persistence>
<sessions> main-responder set

Here, all virtual hosts share the same session management, database persistence, and common theme, and the virtual host dispatch only happens after the request filters through the mentioned layers of functionality. This would not be possible with the old HTTP server without duplicating code.

Cookies


Finally I promised to talk about cookies. The session management support is great but sometimes you just want to get and set cookies directly. This can be done by reading the cookies slot of the request object, and writing the cookies slot of the response object. The slot contains a sequence of cookie objects, which are parsed and unparsed from their HTTP representation for you. A cookie object contains a series of slots, such as name, value, expiration date (as a Factor timestamp object), max-age (as a Factor duration object), path, and host. While the expiration date is deprecated as of HTTP/1.1, most sites still use it in favor of max-age because older browsers don't support max-age. Factor's HTTP server sets the date header on each response so that expiration dates can work correctly.

Here is an example of using the HTTP client (which shares the cookie code with the server) to look at Google's ridiculously long-lived cookies:
( scratchpad ) "http://www.google.com" http-get-stream drop cookies>> first describe
cookie instance
"delegate" f
"name" "pref"
"value" "ID=c0f4c074cd87502e:TM=1209466656:LM=1209466656:S=_6gGEKtuTgP..."
"path" "/"
"domain" ".google.com"
"expires" T{ timestamp f 2010 4 29 10 57 36 ~duration~ }
"max-age" f
"http-only" f

Saturday, April 19, 2008

Performance improvements

Over the last three days I spent some time improving Factor's compiler. I made the following improvements:

These changes have let to some performance improvements on the two benchmarks I was working with. My computer here is a MacBook Pro with a 2.4 GHz Intel Core Duo 2.

The project-euler.150 benchmark saw its runtime decrease from 87 seconds to 22 seconds.

The benchmark.recursive benchmark, which is a Factor implementation of the recursive benchmark from the Computer Language Shootout, saw its runtime decrease from 27 seconds to 9 seconds.

For comparison, SBCL runs the recursive benchmark in 3.5 seconds, and Java runs it in 1.6 seconds.

Using the java -Xint result of 22 seconds, I guesstimated from the results for all languages on the shootout that Factor at around the performance of the Erlang HIPE JIT, and slightly faster than the Python Psyco JIT and GForth. By anybody's standard, this is anywhere between "not very fast" and "bloody slow", but its slowly improving.

Now I'll end with a rant about the language shootout.

The only so-called "dynamic" language implementations which come close to the performance of Java and C on this benchmark are SBCL, Ikarus Scheme, and Chicken Scheme. However, all these benchmarks are actually static programs in disguise, peppered with type declarations and even unsafe low-level features. The Ikarus and Chicken versions even implement the same functions twice, once for integer inputs and using integer arithmetic primitives, and another time for float inputs and using float arithmetic primitives. Unless their goal is really to promote their languages as nothing more than C with s-expression syntax, this is disappointing and dishonest.

The Haskell benchmarks suffer from this also. Many benchmarks seem to use malloc and pointer arithmetic, and exclamation points (strictness annotations) abound. The n-body benchmark contains this gem:
planets :: Ptr Double
planets = unsafePerformIO $ mallocBytes (7 * nbodies * 8) -- sizeOf double = 8

I would expect better from the Haskell people than hardcoding the size of a double to do pointer arithmetic on manually-managed memory!

Right now I have no idea how idiomatic Haskell performs in practice; from looking at these benchmarks, it is clear to me that if one writes C in Haskell, pretty decent performance is possible, but that's not saying much. You can write C in any language.

The runtime of the Factor recursive benchmark further improves by two-fold if I manually replace generic arithmetic with unsafe machine arithmetic, however I'm not interested in writing such code. I don't want to expose low-level primitives to the user, document their use as necessary for performance critical code, and declare that my job is done.

My top 8 shell commands

I wrote a short Factor script to check my bash history and tally up the most frequently occurring commands.
    git
./factor
bf
cdf
ls
cd
fc
push

There are some aliases here:

It seems that the only thing I use my computer for is running Factor!

Friday, April 11, 2008

Multi-touch gestures in the Factor UI

Recently I acquired a MacBook Pro. The new models come with a multi-touch keypad and I've found it extremely useful in Safari to be able to go back, go forward and zoom with the mouse. However, I missed the ability to do this in other applications, especially the Factor UI. The Factor UI already supports vertical and horizontal scrolling gestures, and I wanted to be able to use the other gestures as well.

Apparently, Apple's official stance is that no multitouch API will be made public until 10.6. I was slightly discouraged by this but some more Googling turned up a blog post by Elliott Harris detailing the undocumented API for receiving these events.

While normally I shy away from relying on undocumented functionality in this case the API is dead-simple and it is almost an oversight on Apple's part not to document it. And if they break it, well, it will be easy to update Factor too.

Here are the changes I had to make. First, I added some new UI gestures to extra/ui/gestures/gestures.factor; these are cross-platform and theoretically the Windows and X11 UI backends could produce them too, perhaps as a result of button presses on those "Internet" keyboards:
TUPLE: left-action ;        C: <left-action> left-action
TUPLE: right-action ; C: <right-action> right-action
TUPLE: up-action ; C: <up-action> up-action
TUPLE: down-action ; C: <down-action> down-action

TUPLE: zoom-in-action ; C: <zoom-in-action> zoom-in-action
TUPLE: zoom-out-action ; C: <zoom-out-action> zoom-out-action

Next, I edited extra/ui/cocoa/views/views.factor with some new methods for handling the new multitouch gestures, and translating them to Factor gestures:
{ "magnifyWithEvent:" "void" { "id" "SEL" "id" }
[
nip
dup -> deltaZ sgn {
{ 1 [ T{ zoom-in-action } send-action$ ] }
{ -1 [ T{ zoom-out-action } send-action$ ] }
{ 0 [ 2drop ] }
} case
]
}

{ "swipeWithEvent:" "void" { "id" "SEL" "id" }
[
nip
dup -> deltaX sgn {
{ 1 [ T{ left-action } send-action$ ] }
{ -1 [ T{ right-action } send-action$ ] }
{ 0
[
dup -> deltaY sgn {
{ 1 [ T{ up-action } send-action$ ] }
{ -1 [ T{ down-action } send-action$ ] }
{ 0 [ 2drop ] }
} case
]
}
} case
]
}

Note that I'm throwing away useful information for the sake of simplicity; the zoom gesture gives you a precise zoom amount, not just +1/-1. The swipe gestures seem to be completely discrete though. I haven't implemented rotation gestures yet because I haven't figured out what to use them for.

With the above code written, the UI now sends multi-touch gestures to gadgets however no gadgets used them yet. I fired up the gesture logger, "gesture-logger" run, and tested that the new actions actually get sent. They were, except the first time I messed up the code and got the signs the wrong way round.

With that fixed, I could proceed to add gesture handlers to the various UI tools. I edited various source files:
browser-gadget "multi-touch" f {
{ T{ left-action } com-back }
{ T{ right-action } com-forward }
} define-command-map

inspector-gadget "multi-touch" f {
{ T{ left-action } &back }
} define-command-map

workspace "multi-touch" f {
{ T{ zoom-out-action } com-listener }
{ T{ up-action } refresh-all }
} define-command-map

I think its pretty clear from the above what the default gesture assignments are:

15 minutes of Google, 20 minutes of hacking, and now Factor supports the fanciest feature of Apple's latest hardware.

Improvements to io.monitors; faster refresh-all

Factor's io.monitors library previously supported Mac OS X, Windows and Linux. Now it also supports BSD, but in a much more restricted fashion than the other platforms. Basically you cannot monitor directories, just individual files. This is because kqueue() only provides very limited functionality in this regard. However, having something is better than nothing, and the functionality provided on BSDs is still useful for monitoring log files and such.

On Linux, inotify doesn't directly support monitoring recursive directory hierarchies so Factor's monitors didn't support recursive monitoring, but a mailing list post by Robert Love discusses how to solve this issue in user-space. I used his solution to implement recursive monitors on Linux.

Another oddity relating to inotify is that if you add the same directory twice to the same inotify, you get the same watch ID both times, and events are only reported once. This means that the previous implementation where there was one global inotify instance shared by all monitors wasn't really as general as one would hope, because you couldn't run two programs that monitor overlapping portions of the file system. I thought of several possible fixes but in the end just changed the monitors API to accommodate this case. All monitor operations must now be wrapped in a with-monitors combinator. On Linux, it creates an inotify instance and stores it in a dynamically-scoped variable, so that subsequent calls to <monitor use this inotify. Independent inotifies in different threads don't interact at all. On Mac OS X, BSD and WIndows, with-monitors just calls the quotation without doing any special setup.

Another issue I fixed was that on Mac OS X, monitors would only work when used from the UI because no run loop was running otherwise. I made a run loop run all of the time and this allows monitors to work in the terminal listener.

Now that monitors are working better, I was able to use them to make refresh-all. This word finds all changed source files in the vocabulary roots and reloads them. It does this by comparing cached CRC32 checksums with the actual checksum of the file. Previously it would also compare modification times, but I took that code out because filesystem meta-data queries got moved out of the VM and into the native I/O code, which isn't available during bootstrap. A side-effect of this is that refresh-all became much slower, because it had to read all files. Using monitors I was able to make this faster than it has ever been. A thread waiting on a monitor is started on startup. Then, the source tree only has to be checksummed in its entirety the first time refresh-all is used in a session. Subsequently, only files for which the monitor reported changes have to be scanned. So refresh-all runs instantly if there are no changes, and so on.

Tuesday, April 08, 2008

The golden rule of writing comments

This is something I've wanted to say for a while, and I think many (most?) programmers don't realize it:

Never comment out code.

Comments are for natural-language descriptions of code, or pseudocode maybe.

If you comment out some code, then the parser isn't checking the syntax, the compiler isn't checking semantics, and the unit tests are not unit testing it.

So the code may as well not work. Why have non-working code in your program, especially if it's not called anywhere?

But perhaps the code is there so that you can see the what the program used to do. In that case, just fire up your favorite graphical GIT/SVN/P4/whatever frontend and check out an older revision.

Multi-methods and hooks

For a while now Factor has had 'hooks', which are generic words dispatching on a dynamically scoped variable. Hooks can be used with variables which are essentially global: the current OS, current CPU, UI backend, etc -- or variables which are truly context-specific, such as the current database connection.

I added support for hooks to the extra/multi-methods library, which is going in the core soon. While doing this I was able to significantly generalize the concept. Suppose we want to define a piece of functionality which depends on both the operating system and processor type.

We can begin with defining an ordinary generic word:
GENERIC: cache-size ( -- l1 l2 )

Notice that it is defined as taking no inputs from the stack.

I don't really know the APIs involved here, but suppose that Linux gives us a way to get this info that works across all CPUs. So we define a method specializing on the os dynamic variable (which is really treated like global):
METHOD: cache-size { { os linux } } ... read something from /proc ... ;

Now suppose on Mac OS X we use different APIs per CPU:
METHOD: cache-size { { os macosx } { cpu ppc } } ... ;

METHOD: cache-size { { os macosx } { cpu x86.32 } } ... ;

METHOD: cache-size { { os macosx } { cpu x86.64 } } ... ;

But perhaps on ARM CPUs, we just use an instruction to read the cache size without any OS-specific calls at all:
METHOD: cache-size { { cpu arm } } ... ;

Now in this case, you have an issue where if you're on Linux and ARM, the method that ends up being called depends on the order in which they were defined. If you wanted to explicitly resolve this ambiguity, you would define a new method on { { os linux } { cpu arm } }; because it is more specific than the other two, it is always called first.

The powerful thing about this new implementation of hooks is that not only can you dispatch on multiple variables, but you can add methods to any old generic which dispatches on a variable and the original designer of the generic does not have to explicitly take this into account.

For example, Factor's compiler currently has a large number of hooks dispatching on the CPU type (as an aside, Phil Dawes wrote an excellent introduction to the Factor compiler recently). If those hooks need to be further refined by OS, as is often the case with FFI-related components, the method implementation on the hook needs to perform its own dispatch; this is the "double dispatch" pattern and design patterns are something to be avoided if one wants to write quality code. When multi-methods go in the core, the compiler will simply define a series of generic words taking no inputs from the stack, and each method will specialize on the CPU, and maybe an OS too.

Another new capability is dispatching off stack values and variables in the same method. Among other things, this will be useful in eliminating a case of double dispatch in the core right now, where the <client> word for opening a client socket has to dispatch off the address type on the stack, and then call another hook which dispatches on the I/O backend stored in a variable. This can be combined into a single generic word where some methods dispatch on stack values, and others dispatch on the I/O backend variable.

The other nice thing about this is that the multi-method GENERIC: word unifies and generalizes four words in the core, GENERIC:, GENERIC# for dispatching on a value further down on the stack, HOOK: for dispatching on a variable, and MATH: which performs double dispatch on numbers only.

One of my goals for Factor 1.0 is to get the object system "right", and with the new-style slot accessors, inheritance, and singletons, we're almost there. All that remains to be done is to merge the multi-methods code. The code is still not quite ready to go in the core, though. The only feature that single dispatch has and multiple-dispatch lacks is call-next-method, which is easy to implement. A bigger hurdle to clear is performance; right now multi-methods are implemented in a naive way, where the dispatch time is O(mn) with m methods and n stack positions and variables to dispatch on. This can be improved significantly and I will find time reading the literature on optimizing method dispatch over the next few weeks.

Saturday, April 05, 2008

GC fixes and improvements

With some advice from Daniel Ehrenberg, I have made a few fixes and improvements to Factor's garbage collector.

First of all, there was a potential memory leak situation I overlooked. Suppose that there is a compiled definition in the code heap which references a very large object in the data heap, and neither the compiled definition or the large object is referenced from anywhere else. Because the code heap was only ever collected when it filled up, it was possible that this large object would never be reclaimed, and it would incur unnecessary collection cycles and memory pressure as a result. This could even result in unbounded heap growth if this was done in a loop. For example, the following test case would crash Factor even though it should have run in constant space:
: leak-step 800000 f <array> 1quotation call drop ;

: leak-loop 1000 [ leak-step ] times ;

The fix is to always collect the code heap when collecting the oldest generation. Only collecting the code heap when it fills up is simply unsound because the code heap can reference objects in the data heap. If this was not the case then collecting them independently would be okay, but it isn't.

The next thing I did was improve how allocation of large objects is handled. Previously, if a new object was too large to fit in the nursery, the entire heap would grow, and every time the heap grew it would increase the size of all generations. This is generally not what you want, because if the nursery is too large, we lose locality, and if the accumulation space is too large we waste time copying objects back and forth that should really be in tenured space.

Now, the nursery and accumulation space have a fixed size that can be changed on startup with command line switches, but never changes while Factor is running. If an attempt is made to allocate an object larger than the nursery, it is directly allocated in tenured space. Presumably, if you're making a 2 megabyte array, you're going to do something with it, and hold on top it for at least a few collection cycles, rather than discard it immediately, so it makes sense to avoid the copying altogether and stick it in tenured space. On retarded microbenchmarks this change might result in worse performance, but on more realistic workloads it should be better; having a small nursery helps with locality and avoiding the inevitable copying of the large array from the nursery to accumulation space and then to tenured space is good too.

Future improvements to the GC will include:
Daniel and I will be working on all of this over the summer.

This page is powered by Blogger. Isn't yours?