Thursday, January 31, 2008

Some things I've been working on

I've been busy lately. Here are my current projects:
  • I implemented bit-vectors, byte-vectors and float-vectors, the resizable analogues of bit-arrays, byte-arrays and float-arrays. This nicely rounds out Factor's sequences library; previously the only resizable types were strings and vectors; now every built-in array-like type has a resizable analogue. The literal syntax is simple:

    ?V{ t f t } ! bit vector
    BV{ 1 2 3 } ! byte vector
    FV{ 1.0 2.0 3.0 } ! float vector

    This is of course derived from V{ } for vectors, and ?{ } B{ } F{ } for bit, byte and float arrays, respectively.

    If you want to learn more, just look at the online help.
  • File system change notification: I have this working on Windows. When I implement it on Linux and Mac OS X I will blog about the API. This is really neat stuff, integrates with the I/O multiplexer loop so that waiting for notifications doesn't block all of Factor.
  • New representation for strings. I'm just starting this, but it shouldn't take too long. Right now, Factor strings are sequences of 16-bit characters, which is neither here nor there. It uses more memory than ASCII strings, but cannot represent every Unicode code point, since Unicode 5.0 requires 21 bits now. Totally brain-dead. The new representation is quite clever, and comes from Larceny Scheme. The idea is that strings are ASCII strings, but have an extra slot pointing to an 'auxiliary vector'. If no auxiliary vector is set, the nth character of the string is just the nth byte. If an auxiliary vector is set, then the nth character has the nth byte as the least significant 8 bits, and the most significant 13 bits come from the nth double-byte in the auxiliary vector. Storing a non-ASCII character into the string creates an auxiliary vector if necessary. This reduces space usage for ASCII strings, it can represent every Unicode code point, and for strings with high code points in them, it still uses less space than the other alternative, UTF-32.

3 comments:

Sohail Somani said...

I really like the auxiliary array idea. Is that an original idea?

Slava Pestov said...

Sohail, do you really expect me to invent anything new or original?

No way dude.

This string representation is called "record1" and it comes from Larceny Scheme: http://larceny.ccs.neu.edu/larceny-trac/wiki/StringRepresentations

They offer multiple string representations as a compile-time option. Cool stuff.

For my purposes having one representation is sufficient.

Sohail Somani said...

I just thought it was a clever idea and was curious about the thinking that led to it.

As you have proved, nothing happens in a vacuum anyway!