Thursday, May 31, 2007

Time to move to a new version control system

I used to be pretty happy with Darcs, but now that the repository contains some 3600 patches, Darcs just doesn't scale anymore. The situation is so bad that I'm not sure I'll be able to actually push the new module system patches -- so far I have 8 rather large patches which move a lot of files around, and I'm unable to push it to my x86 box on the LAN. Also Doug has been complaining about Darcs on Windows -- sometimes it takes 10 minutes to record a patch, stuff like that.

Now I don't know if Darcs is using lousy algorithms, or if writing fast code in Haskell is too hard, or if Darcs is simply not designed for larger projects, but I don't want to put up with it anymore. Darcs was a huge improvement over CVS, but now the Factor project has outgrown it.

Anybody know of a distributed version control system which can cope with larger repositories? If I'm unable to push out my current set of changes, I might have to switch to a new VCS really soon now.

Monday, May 28, 2007

Work begins on new module system

More than 5 months ago, Eduardo Cavazos and I designed a new module system. Finally, I have started implementing it. The code is not yet in darcs, and it is a work in progress; I haven't reorganized the compiler or UI sources for the new module system, only the core, tools and help system, so I'm working with a pretty minimal system right now. However updating sources is easy, especially with a powerful editor which can perform complex search and replace, and record macros, such as jEdit.

The following description is technical and won't make much sense unless you know your way around Factor; however if you do know your way around Factor, I guarantee you will appreciate the new module system and find that it will increase your productivity.

To start with, two formerly distinct concepts -- modules and vocabularies -- have been unified. Conceptually, a vocabulary is a collection of definitions together with documentation and tests. Concretely, a vocabulary is a directory with the following structure, where foobar is the vocabulary name:

Tests and documentation is optional; note that the .facts file extension has been superseded by a -docs.factor suffix.

Vocabularies can be nested; for example, a vocabulary named would be structured as follows:

In the above example, foobar/io/io.factor would need to contain an IN: statement before defining any words. There is now a one-to-one mapping between vocabularies and source files. USING: statements load vocabularies, if necessary. In the listener, the require word can be used to load a vocabulary without adding it to the search path.

Vocabulary sources are searched for relative to a "vocabulary root path"; this is a list of prefixes which the vocabulary loader attempts one after the other. The current set is:

The resource: prefix on some of these is expanded into the directory containing the image file; the last item in the list allows one to use the module system to manage sources in the current directory.

Because dependencies between vocabularies are made explicit by the one-to-one mapping from files to sources together with USING: forms, the load.factor files which were a cornerstone of the old module system are going away, and with them, certain classes of bugs:
  • Incorrect load ordering, and forgetting to update the load.factor file during development - this was always a minor but annoying inconvenience.
  • Vocabulary name clashes -- while this has not happened to date, it was theoretically possible that two different modules would nevertheless define two vocabularies with the same name.
  • Missing REQUIRES:; with the old system, it was possible for one source file to use words from a vocabulary defined in another module which was not listed in the REQUIRES: statement of the load file. This would work as long as the required module was loaded first with an explicit call to require. In practice, many developers simply forgot to add the necessary REQUIRES: statements because they always had certain modules loaded. In the new system, this is simply impossible; since you must list vocabularies in USING: before being able to call any words they define, dependencies are always loaded automatically.

Much of the core code has been re-arranged; many words have been moved between vocabularies, and various vocabularies have been split up or merged. This was all done to ensure a one-to-one mapping between sources and vocabularies; it also gave me an excuse to clean up some of the dusty corners of Factor's core library.

The bootstrap system has changed somewhat. Boot images are minimal now; no help, tools, compiler or UI is part of the boot image. Instead, all the extra stuff is loaded in the second stage. While this is slower -- a lot of sources are loaded before anything is compiled, it is more flexible. Whereas before, building an image without the UI required one to comment out sections of boot-stage1.factor, and building an image without the help system was a major pain, now it is a matter of passing the correct command line switches. I will write more about this in a future entry.

We still allow "monkey patching", where a source file (re)defines words in another vocabulary; also, circular dependencies are handled properly. However, both monkey patching and circularity are discouraged; if a clean solution can be found without resorting to either, it is highly preferred, and Factor's developer tools are not as effective with code which uses these tricks. Neither technique will be prohibited, though; Factor is a power tool, not plastic cutlery.

Until now, Factor treated source files are essentially saved listener interactions; loading a source file had the effect of adding new definitions and replacing existing ones, just as if you had entered them in the listener. In effect, the operation of loading a source file was a function of the current image state together with the contents of the file. This simplistic approach is used by every interactive language that I'm aware of, including Lisp, Ruby, Python, etc. However, it has a number of serious flaws which only manifest themselves when one reloads code on the fly, and makes heavy changes to the system in the process. Some of the issues are:
  • Suppose you load a source file which defines a set of words, then you edit the file and remove one of the words. If you reload the source file again, the removed word is still in the image.
  • More seriously, if you move a word from one vocabulary to another and reload both into the image, the old definition is still part of the old vocabulary; other source files might pick it up, depending on search order, and this can lead to very surprising and unexpected behavior.
  • Similarly, if a method is removed and the source file reloaded, then this method will still be part of the image.

These can all be fixed by judicious use of the forget word; after reloading a file, any definitions which were removed and should no longer be part of the image can be forgotten manually. This was tedious and error-prone, but at least it was possible. However there were two other potential problems which did not have a good workaround:
  • If you load a source file which looks like this:
    : A ... ;
    : B ... A ... ;
    : C ... B ... ;

    then in the process of refactoring, accidentally move C before B:
    : A ... ;
    : C ... B ... ;
    : B ... A ... ;

    the source file will still load into the image just fine. However, when you test this file in a fresh image, you will get a parse error, because B is been defined when the definition of C is being parsed.

  • If you load a source file into the image, then remove a word definition which other words depend on, then reload the source file, there is no error. While usually I check usages before renaming or removing stuff, sometimes I'd forget; then some time later, I'd load the code into a fresh image, only to be greeted with a "word not found" error.

  • The final issue is perhaps the most subtle. If you accidentally define two words with the same name in one source file, Factor would not complain; after all, it could not distinguish between an erroneous duplicate definition and a legitimate redefinition filed in by the developer in order to fix a bug in running code.

Now, all of these problems have been completely resolved. Here is how it works. To each source file, the parser associates a list of definitions -- this includes words, methods, help articles, and so on. When a source file is being reloaded, the previous list of definitions is saved first. When the parser encounters a word which is in the previous definition list but not yet in the current definition list for the source file being loaded, it throws an error, signaling that you have an erroneous forward reference; the parser can detect when a word refers to another word which is defined later in the file, even if both words are already in the image from a previous load of the file. This error has a single "continue" restart; if you know what you're doing, you can keep loading the file. Dually, if a word is being defined which is already in the current definition list for the source file, you have a duplicate definition and this is definitely an error; there is no legitimate reason you'd define two words with the same name in one vocabulary, because the former word could never be called.

When parsing is over, the parser compares the new set of definitions against the previous one. It builds a list of definitions from the previous set which are not in the new set; these are obsolete definitions which were at one point present in the source file, but not anymore. The parser checks the cross-referencing database to see if any words use these obsolete definitions; if any words do, a warning is printed listing the obsolete definitions together with their callers. Then, the parser calls forget on each obsolete definition, removing them from the image.

Of course, this could all be simpler if the parser would just forget all definitions from a source file before reloading the source file, but this will not work; if other source files refer to these definitions, then forgetting and redefining them will not affect existing usages, which will continue to refer to the (now inaccessible to new parser instances) forgotten definitions.

Module entry points, used by the run word, are still supported, and are now associated with vocabularies. Module master help articles have no equivalent yet, but will be implemented shortly, and again will be associated with vocabularies. This means each vocabulary will have a help page now.

One last thing. In the past, we had a convention where unsafe words, words which could leave objects in an inconsistent state, and deep implementation detail were placed in vocabularies whose names had been suffixed with -internals; for example, we had a sequences-internals containing sequence access primitives which bypass bounds checks, and a math-internals with type-specific math words, such as float+, from which the generic math operations in the math vocabulary were built. This scheme clearly marked unsafe words as such, to avoid confusing beginners or allowing them to crash the VM; but still allowed experts to access them with a simple USING: statement.

With the new module system, we have a very similar convention, except instead of -internals the suffix is .private. Additionally, instead of writing
IN: foo

: A ... ;

IN: foo.private

: B ... ;

IN: foo

: C ... ;

we have a bit of syntax sugar:
IN: foo

: A ... ;


: B ... ;


: C ... ;

A fair bit of work still remains but I hope to finish the new module system, and document it, by the end of the week.

Finally, I'd like to say that I'm really, really pleased with the new parser feature for catching load order issues and definition removal. This is something I'd expect Lisp implementations to do, but I don't think any attempt it; as far as I know, there is no Lisp implementation out there which does the right thing when one reloads a source file after removing a function or method.

Saturday, May 26, 2007

Problem in Factor 0.89 binary packages for Mac OS X

A couple of people have reported that the Factor 0.89 binaries do not work on their computer. The issue was that while the bundles a copy of the FreeType library, the linker flags at build-time were incorrect, and Factor was searching for this library in /usr/X11R6/lib/. So if you do not have Apple's installed, Factor 0.89 will crash on startup. The workaround is to install

I applied a fix to darcs and a tester confirmed that Factor worked on their X11-less machine. So this won't be a problem anymore in 0.90. I don't think I'll issue new 0.89 binary packages; this is not a fatal bug, just an annoying glitch.

Tuesday, May 22, 2007

Monday, May 14, 2007

Introducing the profiler

Factor now has a simple call-counting profiler. Usage is simple. You must first enable it, which recompiles all words:
USE: profiler

Then you can use a combinator to profile quotations:
( scratchpad ) [ "hello " "world" append ] profile

Then you can view results:
( scratchpad ) profile.
profiling 1
append 1
call 1
> 2
< 2
+ 2
check-copy 2
length 4
(copy) 13

Here we see that very few actual words were called; append is compiled, and so much inlining and open-coding occurs that pretty much the only indication of any work being done is the (copy) word being invoked once per character.

Another example is reversing a string:
( scratchpad ) [ "hello world" reverse ] profile
( scratchpad ) profile.
profiling 1
> 1
< 1
+ 1
like 1
reverse 1
call 1
check-copy 1
frange? 3
column? 3
groups? 3
- 11
circular? 11
<= 12
(copy) 12
gb? 14
reversed? 14
length 18
1- 22
nth-unsafe 22
>fixnum 33

Here we see more work is done. Type checks, generic arithmetic, and so on. This is because of how reverse is implemented; it creates a virtual reversal sequence then converts this into a sequence of the same type as the input.

Of course, the profiler is a lot more useful when applied to non-trivial computations. Here the simple list produced by profile. gets rather large, so various other words, such as vocab-profile., usage-profile. and vocabs-profile. come in handy. All these words are documented in the help system, along with a general article describing the profiler itself. Grab Factor from darcs and enter "profiling" help in the UI listener.

The profiler has a long history. I believe even before CFactor got a native compiler, it had a call counting profiler for a little while. When the compiler came along, I removed it because it didn't work for compiled code. In Factor 0.89, I re-implemented it, intending to add support for compiled profiling before releasing 0.89, however I felt this would delay the release too much, so I left it undocumented and not fully functional.

Last week, I was under the impression that accurate counting was too much overhead anyway, especially with compiled code, so I scrapped the profiler and re-implemented it as a sampling profiler which took measurements every 10ms. Unfortunately, sampling was not as accurate as I would like, and sampling compiled code was complex; it would require writing code for every CPU/OS combination that Factor runs on to get the program counter from a suspended context. Not to mention this didn't work right on Mac OS X. So I went back to accurate call counting, and implemented support for profiling compiled code by compiling a simple prolog which increments a word's counter at the beginning of every word. This is why you must call enable-profiling which recompiles all words first. Except for pathological cases such as Fibonacci, the overhead is not that bad, and the implementation is simple.

So the journey is complete and Factor has a profiler now, which supports both interpreted and compiled code. Even though the profiler is pretty simple (you basically get a flat call count, not a call graph) I've already used it to optimize a few things. I will use it some more to optimize the compiler; now that enable-profiler recompiles all words, I have even more incentive to speed up compile times.

Saturday, May 12, 2007

Disabling the generational collector

Something that fell out naturally from the recent GC refactoring is supporting -generations=1. This disables the generational GC entirely, reverting to the old-fashioned Cheney collector. This has exactly one valid use case: severely memory constrained environments. In such an environment, you'd build a minimal image (something that can be done now by editing core/bootstrap/boot-stage1.factor, but there are no nice tools for this ye), and the overhead of a non-generational collector wouldn't be too bad, since your heap will be small anyway.

Garbage collector improvements

Factor uses a copying generational garbage collector. When a generation fills up, live objects (determined by scanning roots, and the remembered set; more about this later) are copied to a new semi-space, and the old semi-space is cleared. New objects are allocated in the youngest generation, known as the nursery -- a garbage collection of the nursery only is known as a "minor collection".

Until now, objects would always be promoted to an older generation; the only exception was the oldest generation. Clearly when the oldest generation fills up, there is no older generation to promote live objects to, so instead they stay in this generation.

There was a problem with this strategy. Suppose you have 3 generations. During a long-running calculation which allocates many objects, there will be many minor collections. During a minor collection, most objects in the nursery will probably not be live, and of the few live objects that remain, many will likely survive for a long time. But a certain percentage of the live objects are also short-lived, however they survive because they happen to referenced from the data stack, retain stack, or similar when the garbage collector is invoked. References to these objects are quickly lost by the running program, but they are now in the middle generation. Over time, the middle generation fills up, and the middle generation is then collected. Again, some objects in the nursery appear to be live but they will die very quickly; so now we have some short-lived objects in the oldest generation, the one that takes a long time to garbage collect!

The problem here is that objects were always promoted. Promoting objects from the nursery to the middle generation (which I call aging space) is fine, however what you really want is to keep objects in aging space not until aging space fills up, but until no further garbage collection can free up enough room in aging space for objects incoming from the nursery.

So now objects slowly accumulate in aging space, only making their way into tenured space if there is nowhere else they can go. This greatly reduced the number of full garbage collections. This means many extremely-long lived objects -- word definitions, online help documentation, usage cross-referencing, assorted system structures -- will rarely, if ever have to be scanned by the garbage collector during normal operation. Only a world-changing event, such as reloading core libraries, should unsettle things enough to cause a full garbage collection.

A full garbage collection takes 100 ms; a minor collection can finish in less than a millisecond. Yet, the change to promotion behavior described above actually reduced performance significantly. It took me a couple of days to fully figure out why.

Factor's garbage collector uses card marking to rememeber old to new pointers. When the nursery is being collected, any objects in tenured space which refer to objects from the nursery also have to be scanned. So any time an object is mutated, a "card" is marked; cards have a granularity of 64 bytes. When performing a minor garbage collection, all objects in all marked cards are scanned as roots. With the old garbage collector, full garbage collections (which clear all cards, since all objects end up in tenured space after) were frequent enough that card marking did not create any overhead at all. However if you rarely or never perform a full garbage collection, while running a mutation-heavy operation, more and more cards will get marked. I observed some minor garbage collections would end up scanning as much as 600kb of tenured space due to marked cards; that's a lot of objects to scan!

The problem was that card marking was not granular enough. It would remember the fact that a given card contained an object whose slot referred to an object in aging space or nusery space, but it didn't know which of the two it was. Consider the following scenario.

Object X lives in tenured space. A new object Y is allocated in the nursery, and X is mutated so that one of its slots now holds a reference to Y. When the nursery fills up, a minor collection is initiated, and all marked cards are scanned. Noticing that object X refers to Y, the garbage collector knows that Y is live, and promotes Y to aging space. The card remains marked; a future collection of aging space will again need to scan it. However, the object is no longer in the nursery, and nursery collections can from now ignore this card. But there was no way for the garbage collector to be aware of this fact, so future minor collections would scan this card again, and again, and again, and again, ... because full collections were so rare now, many minor collections were actually scanning thousands of cards.

The fix was to refine the card marking somewhat. A card stores two flags; one indicates that objects in the card refer to objects in the nursery, another flag indicates that objects in the card refer to aging space. When scanning a card during a minor collection, the garbage collector ignores cards which only refer to aging space, and a card which refers to the nursery has this flag cleared after objects in the card are scanned. So back to our example, after object Y makes it to aging space, assuming that X is not mutated again, then further minor collections will never touch object X. A collection of aging space will have to scan X, since it refers to Y; the card is still marked. But it is only marked as holding a reference to aging space, not the nursery.

The result is that now, minor collections only ever scan a handful of cards in tenured space -- or none at all. "Card unmarking" saves the day.

Before this change, the default number of generations was 2. The old promotion behavior never seemed to yield a speedup if more generations were used, so that was always the default. However, any number of generations could be specified on the command line using the -generations switch. Now, things are different. You can only ever elect to have 2 or 3 generations. 2 generations is still the default on the ARM architecture; Factor guesses that ARM machines are memory-constrained, so we trade space for speed there. On all other architectures, 3 generations is the default. When 3 generations are used, the middle generation exhibits the new aging behavior.

There is still one last performance regression I need to track down. It is a constant-factor slowdown which appears even when 2 generations are used. I rewrote some of the GC code in the process of implementing these improvements, and as usual I went for clarity over performance. Since the new GC strategy does not involve an increase in the amount of actual logic which must execute for every scanned object, I'm confident I can get rid of this slowdown.

Even though the new code isn't fully optimized yet, GC time is down on all benchmarks which I tried with the exception of a bootstrap together with a load of all contributed modules. Here the constant factor slowdown I described above bites. When it is fixed, I expect GC time will decrease further, and I will publish some benchmarks.

Wednesday, May 09, 2007

Mac OS X data corruption bug

Another day, another bug in the infrastructure underlying Factor.

I submitted this one to Apple. This test case demonstrates the problem; if you're doing memcpy() when there is a probability that an interval timer set by setitimer() might be fired, then you're screwed. This problem is reproducible on PowerMac G5's, but not PowerMac G4's or Intel Macs. So it might be that the signal handling code is not saving/restoring the G5's AltiVec registers correctly (memcpy() uses AltiVec).

How did I come across this bug? I'm working on a (statistical sampling) profiler. It can profile both interpreted and compiled code. Except on Mac OS X/PPC, it kills Factor with a GC assertion...

I love Mac OS X, I really do. But I wish the basic stuff like signal handlers, threads, and so on, was implemented properly.

Friday, May 04, 2007

Interval inference

Interval inference in the compiler is now complete, as of a couple of days ago. The interval code I posted earlier has ballooned into something more complex, because I realized I needed something that supports intervals which exclude or include the endpoints.

On the YUV to RGB conversion benchmark, runtime decreased from 180ms to 90ms. This is still not good enough for real time video playback; open-coded inline intrinsics for alien accessor words should go a long way toward rectifying that; this is something I plan on working on in 0.90. Other benchmarks such as appending sequences show better performance as well, because interval inference is able to remove overflow checks in many situations.

UDP support

About six months ago, Ivan Tikhonov contributed some code for UDP sockets. Unfortunately he lost interest in Factor and never finished his code, so it has been sitting in Lisppaste since then. I decided to dust it off, fix the bitrot, clean it up a bit, and add it to the core. This code implements UDP for all Unices except Solaris, where a definition of the SOCK_DGRAM constant must be provided in core/io/unix/syscalls-solaris.factor -- perhaps somebody can submit a patch for this one? Doug Coleman will deal with the Windows side, as usual.

It is worth noting that with this change, I have moved all networking words to a network vocabulary. I updated all contributed libraries and made sure they load.

The UDP words are very simple to use.

<datagram> ( port -- datagram ) creates a datagram socket bound to a specific port. Passing 0 binds to the first available port.

send ( packet host addr datagram -- ) sends a UDP packet. The packet must be given as a byte array.

receive ( datagram -- packet host addr ) waits to receive a UDP packet.

That is all. I've only tested them lightly, however Ivan's original code was known to work so I don't expect any major problems here.

Another contributor is currently working on support for Unix domain sockets and IPv6. This will round out Factor's networking capabilities quite nicely and put them on par with other languages.

Tuesday, May 01, 2007

Cat 0.12.0 released

Christopher Diggins has released Cat 0.12.0. Notably, he writes:
If all goes well I may be able to declare a version 1.0 release at the end of the month of May with full static type-checking.

I'm looking forward to 1.0.