Factor: a practical stack language: May 2008

Saturday, May 24, 2008

Removing duplicates from a sequence

Albert van der Horst posted to comp.lang.forth about an "elementary but surprisingly difficult problem"; here's a snippet:

An area giving as (addr_start, addr_end) contains items
of length ``len''. I want to get of the duplicates.

Sorting the items is pretty easy, qsort can handle that.
Now eliminating the duplicates, we need only compare one
item to the next.
Something along compact (start, end, len -- end' )
The new end should be returned, of course.

Even using variables/locals, I find this difficult.

The problem with Forth is that while it gives you all the building blocks needed to make high-level abstractions, it doesn't actually give you any pre-packaged ones and there's no culture of code sharing or library development in the Forth community either, so everybody has to reinvent basic sequence processing operations. In contrast, Factor's collection library makes this problem trivial.

Suppose S is our sequence, already sorted. First of all, S[0] is always in the resulting sequence, R. For n > 1, the S[n] is in the result R only if S[n] != S[n-1]. Suppose we make a new sequence, T, such that T[n] is the pair (S[n],S[n+1]). Then we can get our R by removing all pairs from T such that both elements are equal, and then mapping over the result to take the second element form each pair, and finally prepending S[0] to this.

Here is the Factor implementation:

[ 2 clump [ all-equal? not ] filter [ second ] map ] [ first ] bi prefix

No variables, locals or stack shuffling needed.

Here is how the code works:

2 clump takes a sequence on the stack and gives a new sequence of '2 element clumps':

( scratchpad ) { 1 2 2 3 4 4 4 5 6 } 2 clump .
{
    { 1 2 }
    { 2 2 }
    { 2 3 }
    { 3 4 }
    { 4 4 }
    { 4 4 }
    { 4 5 }
    { 5 6 }
}

Next, all-equal? checks if all elements in an array are equal. We want to iterate over the above sequence of pairs, discarding those where both elements are equal; this is what [ all-equal? not ] filter achieves:

( scratchpad ) { 1 2 2 3 4 4 4 5 6 }
( scratchpad ) 2 clump [ all-equal? not ] filter .
{ { 1 2 } { 2 3 } { 3 4 } { 4 5 } { 5 6 } }

Next, we want to take the second element of each pair, [ second ] map:

( scratchpad ) { 1 2 2 3 4 4 4 5 6 }
( scratchpad ) 2 clump [ all-equal? not ] filter [ second ] map .
{ 2 3 4 5 6 }

Finally, we use bi because we do two things to the array; we first find the second element of each non-duplicate clump, then we take the first element of the original array:

( scratchpad ) { 1 2 2 3 4 4 4 5 6 }
( scratchpad ) [ 2 clump [ all-equal? not ] filter [ second ] map ] [ first ] bi . .
1
{ 2 3 4 5 6 }

Now that we have two things on the stack, in the right order, we call prefix to add that element to the original array:

( scratchpad ) { 1 2 2 3 4 4 4 5 6 }
( scratchpad ) [ 2 clump [ all-equal? not ] filter [ second ] map ] [ first ] bi prefix .
{ 1 2 3 4 5 6 }

Of course, this approach allocates some unnecessary intermediate structure. However, it took me about 20 seconds to write, and it worked the first time, I didn't have to mess around with variables, locals, or stack operations.

It works with other data types also:

( scratchpad ) { "a man" "a man" "a plan" "a canal" "panama" "panama" }
( scratchpad ) [ 2 clump [ all-equal? not ] filter [ second ] map ] [ first ] bi prefix .
{ "a man" "a plan" "a canal" "panama" }

Update: Daniel Ehrenberg came up with an even shorter version; figuring it out is an exercise for the reader: [ 2 clump [ = not ] assoc-filter values ] [ first ] bi prefix.

Now, removing duplicates is so common that Factor already has a word in the sets vocabulary named prune which does exactly this, but with a different algorithm:

If you sort then iterate, you get an O(n log n) algorithm, and also it depends on there being an intrinsic order among the elements. Instead, you can use a hashtable; if you have good hashcodes, then in theory you get an O(n) algorithm (but the constant factors might be worse). The idea is that instead you iterate over your sequence, and check if each element is in the hashtable. If it's not there, it's a new element, and you flag it as such then add it to the hashtable. If it's already in the hashtable, you skip it.

Thursday, May 22, 2008

SSL/TLS support added

Some time last year, Elie Chaftari implemented an OpenSSL binding in Factor (and by the way, Elie, good luck with Medioset!). I took this binding and extended it into a high-level SSL library which integrates with Factor's streams and sockets. The effort involved was non-trivial, because OpenSSL's nonblocking I/O behavior is somewhat confusing and not very well documented. So far it only works on Unix. Integration with the Windows overlapped I/O system is still pending.
I will start by presenting a simple client/server application that uses standard sockets, then convert it to use SSL.
We will implement an application which sends the current time of day, formatted as an RFC822 string, to the user.
First of all, how do we print such a string? We get the current time with now, which pushes a timestamp object on the stack, then convert it to an RFC822 string with timestamp>rfc822:

( scratchpad ) USING: calendar calendar.format ;
( scratchpad ) now timestamp>rfc822 print
Thu, 22 May 2008 01:25:18 -0500

We want to send this to every client that connects, so we use the with-server combinator from the io.server vocabulary. It takes four parameters:

A sequence of addresses to listen on. We create these addresses by passing a port number to the internet-server word. It gives us back a sequence of addresses:
```
( scratchpad ) 9670 internet-server .
{
    T{ inet6 f "0:0:0:0:0:0:0:0" 9670 }
    T{ inet4 f "0.0.0.0" 9670 }
}
```
A service name for logging purposes. We pass "daytime", which means that connections will be logged to the logs/daytime/1.log file in the Factor directory. We can then invoke rotate-logs to move 1.log to 2.log, and so on. Log analysis tools are available too. I've discussed this in a prior blog post.
Finally, a quotation which is run for each client connection, in a new thread. Our quotation just prints the current time:
```
[ now timestamp>rfc822 print ]
```
An I/O encoding specifier for client connections. We only care about ASCII so we pass ascii.

Here is a daytime-server word, with the complete with-server form:

USING: calendar.format calendar io.server io.encodings.ascii ;

: daytime-server ( -- )
    9670 internet-server
    "daytime"
    ascii
    [ now timestamp>rfc822 print ]
    with-server ;

We can start our server as follows:

USE: threads

[ daytime-server ] in-thread

Now, let's test it with telnet:

slava$ telnet localhost 9670
Trying ::1...
Connected to localhost.
Escape character is '^]'.
Thu, 22 May 2008 01:32:17 -0500
Connection closed by foreign host.

Works fine. The log file logs/daytime/1.log will have now logged something like the following:

[2008-05-22T01:32:17-05:00] NOTICE accepted-connection: { T{ inet6 f "0:0:0:0:0:0:0:1" 50916 } }

Now, let's write a client for this service. We'll write the client using the with-client combinator from io.sockets. It takes three parameters:

An address specifier. This is the server to connect to. We're going to connect to port 9670 on localhost, so we pass "localhost" 9670 <inet>.
An encoding specifier. Again, we're only interested in 7-bit ASCII, so we pass ascii.
The final parameter is a quotation. We wish to a line of input from the server and leave it on the stack, so we pass [ readln ].

Here is the complete with-client form:

USING: io io.sockets io.encodings.ascii ;

"localhost" 9670 <inet> ascii [ readln ] with-client

We're not quite done though; we should really parse the timestamp from its RFC822 string representation back into a timestamp object. We can do this with the rfc822>timestamp word. Here is the complete daytime-client word, refactored to take a host name from the stack:

USING: calendar.format io io.sockets io.encodings.ascii ;

: daytime-client ( hostname -- timestamp )
    9670 <inet> ascii [ readln ] with-client
    rfc822>timestamp ;

Let's test our program, using the describe word to get a verbose description of the timestamp returned:

( scratchpad ) "localhost" daytime-client describe
timestamp instance
"delegate"   f
"year"       2008
"month"      5
"day"        22
"hour"       1
"minute"     37
"second"     47
"gmt-offset" T{ duration f 0 0 0 -5 0 0 }

So we're done! Here is the complete program:

USING: calendar.format calendar
io io.server io.sockets
io.encodings.ascii ;
IN: daytime

: daytime-server ( -- )
    9670 internet-server
    "daytime"
    ascii
    [ now timestamp>rfc822 print ]
    with-server ;

: start-daytime-server ( -- )
    [ daytime-server ] in-thread ;

: daytime-client ( hostname -- timestamp )
    9670 <inet> ascii [ readln ] with-client
    rfc822>timestamp ;

You can put this in a file named work/daytime/daytime.factor in the Factor directory, then enter the following in the listener:

( scratchpad ) USE: daytime
Loading P" resource:work/daytime/daytime.factor"
( scratchpad ) start-daytime-server
( scratchpad ) "localhost" daytime-client describe

So we have a simple client/server application with support for multiple connections and logging.

Let's add SSL support! There are four things to change.

We need to add io.sockets.secure to our USING: list:

USING: calendar.format calendar
io io.server io.sockets io.sockets.secure
io.encodings.ascii ;

We change daytime-server to accept SSL connections by replacing internet-server with secure-server; we also change the port number:
```
: daytime-server ( -- )
    9671 secure-server
    "daytime"
    ascii
    [ now timestamp>rfc822 print ]
    with-server ;
```
The secure-server word is much like internet-server in that it takes a port number and outputs a list of addresses, except this time the addresses are secure addresses, which means that a server socket listening on one will accept SSL connections:
```
( scratchpad ) 9671 secure-server .
{
    T{ secure f T{ inet6 f "0:0:0:0:0:0:0:0" 9671 } }
    T{ secure f T{ inet4 f "0.0.0.0" 9671 } }
}
```

Next, we change start-daytime-server to establish SSL configuration parameters. We need two things here, Diffie-Hellman key exchange parameters, and a certificate for the server.
Diffie-Hellman key exchange parameters can be generated with a command-line tool:
```
openssl dhparam -out dh1024.pem 1024
```
There are several ways to go about obtaining a server certificate. You can either fork out some money to a CA such as VeriSign or GoDaddy (prices range from $15 to $3000), or you can generate a self-signed certificate, for example by following these directions. Once you have the two files, dh1024.pem and server.pem, place them inside work/daytime/, and modify start-daytime-server to wrap the daytime-server call with the with-secure-context combinator. This combinator takes a secure-config object from the stack, and we fill in the slots of this object with the pathnames of the two files we just generated:
```
: start-daytime-server ( -- )
    [
        <secure-config>
            "resource:work/daytime/dh1024.pem" >>dh-file
            "resource:work/daytime/server.pem" >>key-file
        [ daytime-server ] with-secure-context
    ] in-thread ;
```
Note that you must nest with-secure-context inside in-thread, not vice versa, because in-thread returns immediately, and with-secure-context destroys the context after its quotation returns.

Finally, we change daytime-client to establish SSL connections. It suffices to change <inet> to <inet> <secure>, and use the new port number we defined in daytime-server. This makes with-client establish an SSL connection:
```
: daytime-client ( hostname -- timestamp )
    9671 <inet> <secure> ascii [ readln ] with-client
    rfc822>timestamp ;
```
However, by default, the client will validate the server's certificate, and unless you're using a certificate signed by a root CA, this will fail.
To disable verification, we can wrap client calls in an SSL configuration with verification disabled:
```
<secure-config> f >>verify [ "localhost" daytime-client ] with-secure-context
```

Here is our complete daytime client/server vocabulary which uses SSL:

USING: calendar.format calendar
io io.server io.sockets io.sockets.secure
io.encodings.ascii ;
IN: daytime

: daytime-server ( -- )
    9671 secure-server
    "daytime"
    ascii
    [ now timestamp>rfc822 print ]
    with-server ;

: start-daytime-server ( -- )
    [
        <secure-config>
            "resource:work/daytime/dh1024.pem" >>dh-file
            "resource:work/daytime/server.pem" >>key-file
        [ daytime-server ] with-secure-context
    ] in-thread ;

: daytime-client ( hostname -- timestamp )
    9671 <inet> <secure> ascii [ readln ] with-client
    rfc822>timestamp ;

: daytime-client-no-verify ( hostname -- timestamp )
    #! Use this if your server has a self-signed certificate
    <secure-config> f >>verify
    [ daytime-client ] with-secure-context ;

That's it.
There is one more topic to cover, and that is mixed secure/insecure servers. Often you want to listen for insecure connections on one port, and secure connections on another. We can achieve this easily enough by concatenating the results of internet-server and secure-server:

( scratchpad ) 9670 internet-server 9671 secure-server append .
{
    T{ inet6 f "0:0:0:0:0:0:0:0" 9670 }
    T{ inet4 f "0.0.0.0" 9670 }
    T{ secure f T{ inet6 f "0:0:0:0:0:0:0:0" 9671 } }
    T{ secure f T{ inet4 f "0.0.0.0" 9671 } }
}

The server will now listen on all four addresses, and the quotation can differentiate between insecure and secure clients by checking if the value of the remote-address variable is an instance of secure or not using the secure? word.
So what is missing? Several things:

As I've mentioned at the beginning, SSL sockets don't yet work on Windows. This will be addressed soon.
Session resumption is not supported, although this is relatively easy to implement; I just wanted to get everything else working first.
Client-side authentication isn't supported either. Again, this is pretty easy, and I'll address it soon.
While it is very easy to create client and server SSL sockets (just wrap your addresses with <secure>) there is currently no way to upgrade a socket that has already been opened to an SSL socket. This is needed for SMTP/TLS support. I haven't thought of a nice API for this, as soon as I do I will implement it.
Better control is needed over cypher suites and certificate validation.

Tuesday, May 20, 2008

What's up with Digg.com?

I was testing Factor's http.client library with digg.com, and I was just getting timeouts.

Connecting with telnet now:

$ telnet digg.com 80
Trying 64.191.203.30...
Connected to digg.com.
Escape character is '^]'.
GET / HTTP/1.1
host: digg.com

Connection closed by foreign host.

So telnet knows the connection is being dropped, but Factor doesn't detect this. I thought this might be a bug in Factor, but with all other sites I've tried, Factor correctly detects a closed connection, whereas with Digg, read() on the socket always returns -1 with errno set to EAGAIN. I'm not even sure how telnet detects the connection is being dropped.

To make matters even more confusing, Java doesn't detect the socket was closed either; the following testcase hangs:

import java.net.*;
import java.io.*;

public class DiggTest
{
 public static void main(String[] args) throws IOException
 {
  Socket s = new Socket("digg.com",80);

  Writer w = new OutputStreamWriter(s.getOutputStream());
  Reader r = new InputStreamReader(s.getInputStream());

  w.write("GET / HTTP/1.1\r\nHost: www.digg.com\r\n\r\n");
  w.flush();

  int ch = r.read();
 }
}

It turns out a TCP RESET packet is being sent, but for whatever reason, Factor and Java don't receive any notification of it, wheras telnet does.

I'm completely stumped at this point. Anybody have an idea of what's going on?

Update: As Jason points out in the comments, adding a User-Agent makes it work. I was aware of this already, but that's not the interesting part; what puzzles me is why Factor (and Java) don't pick up on the socket being closed.

Wednesday, May 14, 2008

Improved destructors merged into core; working on SSL binding

I don't have anything earth-shattering to blog about at this stage, so this post is mostly just a disorganized progress report for the last week or so.

Last September, Doug Coleman implement a destructors abstraction for Factor. At the start of this year, I implemented a generic resource disposal protocol. I combined these two concepts into a single vocabulary and moved it into the core.

Over the last few months, destructors have proved their worth over and over again when dealing with C libraries. Doug's post above describes them in great detail.

Here I will skim over the general concept, together with an overview of the new word names.

Suppose you need to call a C function which takes an array of three structures. The array itself can be allocated in the Factor heap, as a byte array, however the structures must be allocated in the malloc heap because the garbage collector doesn't know about pointers inside byte arrays. So we can start by writing the following code:

FUNCTION: int our_c_function ( SHAPE* shapes ) ;

"SHAPE" malloc-object
"SHAPE" malloc-object
"SHAPE" malloc-object
3array >c-void*-array
our_c_function 0 < [ "Error return from C function" throw ] when

This is useless because it leaks memory every time it is called.

Let's keep the array around and deallocate them when we're done:

"SHAPE" malloc-object
"SHAPE" malloc-object
"SHAPE" malloc-object
3array [
    >c-void*-array our_c_function
    0 < [ "Error return from C function" throw ] when
] keep
[ free ] each

This is still problematic. Any one of the malloc-object calls can fail, and we're throwing an exception too. The solution is to use destructors:

[
    "SHAPE" malloc-object &free
    "SHAPE" malloc-object &free
    "SHAPE" malloc-object &free
    3array >c-void*-array our_c_function
    0 < [ "Error return from C function" throw ] when
] with-destructors

The &free word schedules an allocated block of memory for deallocation for when the with-destructors word returns, whether or not an error was thrown.

The more general word is &dispose, which can take any object implementing the generic resource disposal protocol. And this is where the destructors and disposal meet. The with-disposal combinator is now just a shorthand for with-destructors:

[ X ] with-disposal == [ &dispose X ] with-destructors

Recent language changes, such as destructors, the new slot accessors, inheritance, and the thread refactoring, have really cleaned up the Windows I/O code in particular. It used to be absolutely hideous with too much stack shuffling and repetition, but after my next set of patches are pushed out, it is pretty close to exemplary Win32 API usage -- and much cleaner than anything you'd write in C, which is a pretty good accomplishment I think, given that Win32 is designed around C. I will blog about the Windows I/O code as soon as I finish up with the next set of changes.

What motivated moving destructors in the core is working on OpenSSL bindings. The OpenSSL API is pretty complex, especially so when doing non-blocking I/O, and I've had to overhaul the Unix I/O code to be able to plug it in. While overhauling the code I noticed a few places where my error handling logic was convoluted, and even a resource leak, so I used destructors to clean them up.

I will blog about the new io.sockets.secure vocabulary when the OpenSSL binding is done. For now, I can show off OpenSSL checksums support. A few weeks ago I removed some duplication between our CRC32, MD5, SHA1 and SHA2 implementations by implementing a generic checksum protocol. Since then Doug also added an Adler32 implementation. The idea behind the protocol is that you have three words, checksum-bytes, checksum-stream, and checksum-file, each one taking some kind of object and a singleton representing the checksum algorithm. For example,

"factor.image" md5 checksum-file .
=> B{ 177 55 229 87 205 67 139 188 136 33 111 38 3 0 32 23 }

However, if you try the Factor MD5 implementation you will notice it is rather slow. While it will become faster over time, OpenSSL already has a fast implementation. Thanks to the checksum protocol and some FFI fun, you can use it:

"factor.image" openssl-md5 checksum-file .
=> B{ 177 55 229 87 205 67 139 188 136 33 111 38 3 0 32 23 }

There is also openssl-sha1. In fact any checksum supported by OpenSSL's EVP_* functions can be used, as follows:

"factor.image" "SHA256" <openssl-checksum> checksum-file

The code is in checksums.openssl.

Saturday, May 10, 2008

Garbage collection throughput improvements

The problem

About a month ago, I made some changes to the garbage collector. These changes included drastically decreasing the default nursery size from 16mb to 1mb. This turned out to have a negative effect on performance, with GC consuming as much as 60% of run time on some tests.

Instead of making the default nursery larger, which would just mask the problem until we hit workloads with 16 times as much allocation (or when the compiler began to generate code that was 16 times faster, or any combination of the above), I decided to investigate the root cause of the problem.

Turns out that every minor GC was taking on the order of a few milliseconds at the minimum, which is too much time. Most of the time was spent scanning the cards array for marked cards. Even if there weren't any marked cards, with a 64mb heap and 64 byte card size, that's a million array entries to check.

Faster card scanning

The first thing I did was change how the card scan loop operates. Instead of iterating over the array a byte at a time, and checking if each byte satisfies the card mark mask, we iterate over the array four bytes at a time, checking each group of four bytes against the mask. GCC should really be doing this optimization for me, because its a simple case of vectorization, but it doesn't, so I just wrote the code out by hand:

u32 *quad_ptr;
u32 quad_mask = mask | (mask << 8) | (mask << 16) | (mask << 24);

for(quad_ptr = (u32 *)first_card; quad_ptr < (u32 *)last_card; quad_ptr++)
{
        if(*quad_ptr & quad_mask)
        {
                F_CARD *ptr = (F_CARD *)quad_ptr;
                
                int card;
                for(card = 0; card < 4; card++)
                {
                        if(ptr[card] & mask)
                        {
                                collect_card(&ptr[card],gen,here);
                                ptr[card] &= ~unmask;
                        }
                }
        }
}

This improved card scan performance by a factor of 3, which makes sense, because we do four times less loop iterations, but at some point, you still have to hit memory, which dampens things out a bit.

This optimization improved performance, but not sufficiently so. I realized that card scanning hurt the most in benchmarks which didn't write into objects in the old generation at all; for example, a benchmark which operates on bignums, where the intermediate values are extremely short-lived, fills up the nursery very rapidly, and every minor GC checks all cards, which not only fills up the cache with junk, but doesn't achieve anything useful as no cards will be marked.

On the other hard, a pointer recording approach where all stores are written into a buffer is also unacceptable, because in code with a high rate of mutation, you pay a hefty price every time the buffer overflows. So I had to find a compromise.

Introducing card decks

My solution is something I haven't seen done anywhere else before, but it is pretty trivial if you think about it. It amounts to two-level card marking. We have an array of cards, where each card maps to a small amount of memory: 64 bytes in my case, but it could be anything; and an array of card decks. Each card deck corresponds to 1024 cards, and if a card is marked, then its deck must be marked too. This way, on a minor GC, we scan the deck array first, and then only check cards corresponding to marked decks.

This complicates the write barrier, because now on every store we have to mark a card and a deck, not just a card as before. I decided to counterweight this by simplifying the write barrier in another way. Formerly, it would read the card, "or" it with a mask, and store it back. This was done because the card contained an additional piece of information, and that is the offset within the card where the first object begins. Cards have two mark bits, denoting a possible pointer from this generation to either aging space or the nursery, and the other six bits were reserved for the offset of the first object. By moving object offsets (I call them "allot markers") into a separate array, I removed a read operation from the write barrier; now it simply has to write the mark mask to the card, without caring about its existing contents. This also freed up two bits in the allot marker, allowing me to switch to 256 byte cards (and 256 kilobyte card decks).

These improvements really helped on benchmarks which allocated large numbers of extremely short-lived objects, however on benchmarks which allocated longer-lived objects the improvements were still major but not so drastic. It turns out that if objects survives long enough to be copied from the nursery into aging space, then it is possible for this copying to begin to dominate running time.

Low-level optimizations

The inner loop of the GC looked like so:

void copy_handle(CELL *handle)
{
        CELL pointer = *handle;

        if(!immediate_p(pointer) && should_copy(pointer))
                *handle = copy_object(pointer);
}

CELL collect_next(CELL scan)
{
        do_slots(scan,copy_handle);

        if(collecting_gen == TENURED)
                do_code_slots(scan);

        return scan + untagged_object_size(scan);
}

This is very elegant code; do_slots is a higher-order function which applies a function pointer to each slot of the object. However, GCC doesn't try to optimize higher-order code at all, after all it is a C compiler not an ML compiler, and it isn't Sufficiently Smart Either! So I rewrote the above function by manually inlining do_slots() and replacing the function pointer invocation with the body of copy_handle(). The next thing I realized was that should_copy() contains a large series of conditional tests which did not depend on its input parameters, only global variables whose value was invariant for the duration of the collection:

/* test if the pointer is in generation being collected, or a younger one. */
INLINE bool should_copy(CELL untagged)
{
        if(in_zone(newspace,untagged))
                return false;
        if(collecting_gen == TENURED)
                return true;
        else if(HAVE_AGING_P && collecting_gen == AGING)
                return !in_zone(&data_heap->generations[TENURED],untagged);
        else if(HAVE_NURSERY_P && collecting_gen == NURSERY)
                return in_zone(&nursery,untagged);
        else
        {
                critical_error("Bug in should_copy",untagged);
                return false;
        }
}

So again, I did a bit of hand-optimization. If foo() does not depend on the loop variable or the side effects of the body, then the following are equivalent:

while(A)
{
        if(foo()) B else C;
}

if(foo)
{
        while(A) B;
}
else
{
        while(A) C;
}

In the optimizing compiler literature, this is known as loop unswitching.

The new collect_next() is much longer, but also much faster. It looks like this:

CELL collect_next(CELL scan)
{
        CELL *obj = (CELL *)scan;
        CELL *end = (CELL *)(scan + binary_payload_start(scan));

        obj++;

        CELL newspace_start = newspace->start;
        CELL newspace_end = newspace->end;

        if(HAVE_NURSERY_P && collecting_gen == NURSERY)
        {
                CELL nursery_start = nursery.start;
                CELL nursery_end = nursery.end;

                for(; obj < end; obj++)
                {
                        CELL pointer = *obj;

                        if(!immediate_p(pointer)
                                && (pointer >= nursery_start && pointer < nursery_end))
                                *obj = copy_object(pointer);
                }
        }
        else if(HAVE_AGING_P && collecting_gen == AGING)
        {
                F_ZONE *tenured = &data_heap->generations[TENURED];

                CELL tenured_start = tenured->start;
                CELL tenured_end = tenured->end;

                for(; obj < end; obj++)
                {
                        CELL pointer = *obj;

                        if(!immediate_p(pointer)
                                && !(pointer >= newspace_start && pointer < newspace_end)
                                && !(pointer >= tenured_start && pointer < tenured_end))
                                *obj = copy_object(pointer);
                }
        }
        else if(collecting_gen == TENURED)
        {
                for(; obj < end; obj++)
                {
                        CELL pointer = *obj;

                        if(!immediate_p(pointer)
                                && !(pointer >= newspace_start && pointer < newspace_end))
                                *obj = copy_object(pointer);
                }

                do_code_slots(scan);
        }
        else
                critical_error("Bug in collect_next",0);

        return scan + untagged_object_size(scan);
}

That's a hell of a code explosion, but it made a real difference to the performance of the garbage collector.

Benchmarks

The first benchmark just allocates 3-element arrays in a loop:

: garbage 1 2 3 3array ;

: garbage-loop 150000000 [ garbage drop ] times ;

[ garbage-loop ] time

Here are the results with different nursery sizes; all times are in milliseconds and I ran each benchmark multiple times to ensure I was getting reliable results:

	Before	After
1mb nursery:	8908	4461
2mb nursery:	6455	4451
4mb nursery:	5336	4582
8mb nursery:	4779	4582

The second benchmark allocates larger arrays in a loop:

: garbage 100 f <array> ;

: garbage-loop 15000000 [ garbage drop ] times ;

[ garbage-loop ] time

The results are similar:

	Before	After
1mb nursery:	10418	2158
2mb nursery:	6364	2178
4mb nursery:	4966	3223
8mb nursery:	4249	3294

An interesting phenomenon occurs where after the GC changes, a larger nursery actually begins to hurt performance. Dan and I hypothesized that this is due to the larger nursery hurting locality, with poor algorithms masking this effect with the older GC.

The next benchmark allocates a lot of temporary arrays but stores them into larger arrays which are somewhat longer lived:

: garbage 100 f <array> ;

: garbage-loop 150 [ 10000 [ drop garbage ] map drop ] times ;

[ garbage-loop ] time

The size of the aging generation makes a difference here so I tried the benchmark with more parameters:

	Before	After
1mb nursery/2mb aging:	8693	5525
2mb nursery/2mb aging:	7318	5255
4mb nursery/2mb aging:	3939	2628
8mb nursery/2mb aging:	2314	1526
1mb nursery/8mb aging:	3455	1493
4mb nursery/4mb aging:	3024	1894
4mb nursery/4mb aging:	3157	1577
8mb nursery/8mb aging:	1533	1003

The above benchmarks are somewhat unrealistic, however other benchmarks showed speedups too, some more drastic than others:

Benchmark	Before	After
SHA1	12198	7282
Random	18989	14459
Sort	15685	13476
Raytracer	27700	12966
Mandelbrot	4633	1847

Before the shrinking of the nursery, the runtimes of these benchmarks looked much like they do now; it was only until a month ago that GC time began to dominate benchmarks like that. However I like having a small nursery, and making it smaller forced me to optimize the garbage collector for higher throughput in constrained-memory conditions.

Looking ahead

The next big change I will be making to the garbage collector is switching to using mark/sweep for the old generation. This will reduce memory consumption by almost 50%.

The other side of this coin is compiler optimizations. If the compiler performed more advanced static analysis, it could eliminate much of the dynamic memory allocation in the above benchmarks, cutting GC time further. Of course, if we have a great compiler and a great GC, we'll just have great performance all around.

Monday, May 05, 2008

I/O changes, and process pipeline support

I made some improvements to the I/O system today.

Default stream variables

The stdio variable has been replaced by input-stream and output-stream, and there are four new words:

with-input-stream
with-output-stream
with-input-stream*
with-output-stream*

The first two close the stream after, the latter do not. The with-stream and with-stream* words are still around, they expect a duplex stream, unpack it, and bind both variables.

I've changed many usages of with-stream to use one of the unidirectional variants instead. This means that you can now write code like the following:

"foo.txt" utf8 [
    "blah.txt" utf8 [
        ... read from the first file, write to the second file,
        using read and write ...
    ] with-file-writer
] with-file-reader

Before you had to use this really ugly "design pattern":

"foo.txt" utf8 <file-reader> [
    "blah.txt" utf8 <file-writer> [
        <duplex-stream> [
            ...
        ] with-stream
    ] with-disposal
] with-disposal

Speaking of duplex streams, because they're not used by anything in the core anymore I have moved them to extra. They are still used by <process-stream> and <client>. I added a <process-reader> and <process-writer> word for those cases where you only want a pipe in one direction; they express intention better.

Pipes

The <process-stream> word has been around for a while, and this word used pipes internally, but they were not exposed in a nice, cross-platform way, until now.

The io.pipes vocabulary contains two words:

<pipe> creates a new pipe and wraps a pair of streams around it. The streams are packaged into a single duplex stream; any data written to the stream can be read back from the same stream (presumably, in a different thread). This is actually implemented with native pipes on Unix and Windows.
The run-pipeline word takes a sequence of quotations or launch descriptors, and runs them all in parallel with input wired up as if it were a Unix shell pipe. For example,
```
{ "cat foo.txt" "grep x" "sort" "uniq" } run-pipeline
```
Corresponds to the following shell command:
```
cat foo.txt | grep x | sort | uniq
```
In addition, being able to place process objects and quotations in the pipeline gives you a lot of expressive power.

Appending process output to files

The io.launcher vocabulary supports the full array of input and output redirection features, and now, pipelines. There was one missing component: redirecting process output to a file opened for appending. Now this is possible. The following Factor code:

<process>
    "do-stuff" >>command
    "log.txt" <appender> >>stderr
run-process

Corresponds to this shell command:

do-stuff 2>> do-stuff.txt

Of course, shell script is a DSL for launching processes so it is more concise than Factor. However, Factor's io.launcher library now supports all of the features that the shell does, and its pretty easy to build a shell command parser using Chris Double's PEG library, which translates shell commands to sequences of Factor process descriptors in a pipeline.

And now, I present a concise illustration of the difference between the Unix philosophy and the Windows philosophy. Here we have two pieces of code, which do the exact same thing: create a new pipe, open both ends, return a pair of handles.

Unix:

USING: system alien.c-types kernel unix math sequences
qualified io.unix.backend io.nonblocking ;
IN: io.unix.pipes
QUALIFIED: io.pipes

M: unix io.pipes:(pipe) ( -- pair )
    2 "int" <c-array>
    dup pipe io-error
    2 c-int-array> first2
    [ [ init-handle ] bi@ ] [ io.pipes:pipe boa ] 2bi ;

Windows:

USING: alien alien.c-types arrays destructors io io.windows libc
windows.types math.bitfields windows.kernel32 windows namespaces
kernel sequences windows.errors assocs math.parser system random
combinators accessors io.pipes io.nonblocking ;
IN: io.windows.nt.pipes

: create-named-pipe ( name -- handle )
    { PIPE_ACCESS_INBOUND FILE_FLAG_OVERLAPPED } flags
    PIPE_TYPE_BYTE
    1
    4096
    4096
    0
    security-attributes-inherit
    CreateNamedPipe
    dup win32-error=0/f
    dup add-completion
    f <win32-file> ;

: open-other-end ( name -- handle )
    GENERIC_WRITE
    { FILE_SHARE_READ FILE_SHARE_WRITE } flags
    security-attributes-inherit
    OPEN_EXISTING
    FILE_FLAG_OVERLAPPED
    f
    CreateFile
    dup win32-error=0/f
    dup add-completion
    f <win32-file> ;

: unique-pipe-name ( -- string )
    [
        "\\\\.\\pipe\\factor-" %
        pipe counter #
        "-" %
        32 random-bits #
        "-" %
        millis #
    ] "" make ;

M: winnt (pipe) ( -- pipe )
    [
        unique-pipe-name
        [ create-named-pipe dup close-later ]
        [ open-other-end dup close-later ]
        bi pipe boa
    ] with-destructors ;

The Windows API makes things difficult for no reason. Anonymous pipes do not support overlapped I/O, so you have to open a named pipe with a randomly-generated name (I'm not making this up, many other frameworks do the same thing and Microsoft even recommends this approach).

On the flipside, the nice thing about supporting both Unix and Windows is that I get to come up with true high-level abstractions that make sense, instead of being able to get away with having thin wrappers over Unix system calls as so many other language implementations do. For example, Ocaml's idea of "high-level I/O" is some basic POSIX bindings, together with incomplete emulation of Unix semantics on Windows. Look forward to writing a page of code if you want to map a file into memory or run a process and read its output into a string. And of course Java doesn't support pipes, I/O redirection for launching processes, or file system change monitoring, at all.

Factor: a practical stack language