Monday, May 28, 2007

Work begins on new module system

More than 5 months ago, Eduardo Cavazos and I designed a new module system. Finally, I have started implementing it. The code is not yet in darcs, and it is a work in progress; I haven't reorganized the compiler or UI sources for the new module system, only the core, tools and help system, so I'm working with a pretty minimal system right now. However updating sources is easy, especially with a powerful editor which can perform complex search and replace, and record macros, such as jEdit.

The following description is technical and won't make much sense unless you know your way around Factor; however if you do know your way around Factor, I guarantee you will appreciate the new module system and find that it will increase your productivity.

To start with, two formerly distinct concepts -- modules and vocabularies -- have been unified. Conceptually, a vocabulary is a collection of definitions together with documentation and tests. Concretely, a vocabulary is a directory with the following structure, where foobar is the vocabulary name:
foobar/foobar.factor
foobar/foobar-tests.factor
foobar/foobar-docs.factor

Tests and documentation is optional; note that the .facts file extension has been superseded by a -docs.factor suffix.

Vocabularies can be nested; for example, a vocabulary named foobar.io would be structured as follows:
foobar/io/io.factor
foobar/io/io-tests.factor
foobar/io/io-docs.factor

In the above example, foobar/io/io.factor would need to contain an IN: foobar.io statement before defining any words. There is now a one-to-one mapping between vocabularies and source files. USING: statements load vocabularies, if necessary. In the listener, the require word can be used to load a vocabulary without adding it to the search path.

Vocabulary sources are searched for relative to a "vocabulary root path"; this is a list of prefixes which the vocabulary loader attempts one after the other. The current set is:
{
"resource:work"
"resource:core"
"resource:core/collections"
"resource:core/compiler"
"resource:libs"
"resource:libs/collections"
"resource:apps"
"resource:unmaintained"
"."
}

The resource: prefix on some of these is expanded into the directory containing the image file; the last item in the list allows one to use the module system to manage sources in the current directory.

Because dependencies between vocabularies are made explicit by the one-to-one mapping from files to sources together with USING: forms, the load.factor files which were a cornerstone of the old module system are going away, and with them, certain classes of bugs:
  • Incorrect load ordering, and forgetting to update the load.factor file during development - this was always a minor but annoying inconvenience.
  • Vocabulary name clashes -- while this has not happened to date, it was theoretically possible that two different modules would nevertheless define two vocabularies with the same name.
  • Missing REQUIRES:; with the old system, it was possible for one source file to use words from a vocabulary defined in another module which was not listed in the REQUIRES: statement of the load file. This would work as long as the required module was loaded first with an explicit call to require. In practice, many developers simply forgot to add the necessary REQUIRES: statements because they always had certain modules loaded. In the new system, this is simply impossible; since you must list vocabularies in USING: before being able to call any words they define, dependencies are always loaded automatically.


Much of the core code has been re-arranged; many words have been moved between vocabularies, and various vocabularies have been split up or merged. This was all done to ensure a one-to-one mapping between sources and vocabularies; it also gave me an excuse to clean up some of the dusty corners of Factor's core library.

The bootstrap system has changed somewhat. Boot images are minimal now; no help, tools, compiler or UI is part of the boot image. Instead, all the extra stuff is loaded in the second stage. While this is slower -- a lot of sources are loaded before anything is compiled, it is more flexible. Whereas before, building an image without the UI required one to comment out sections of boot-stage1.factor, and building an image without the help system was a major pain, now it is a matter of passing the correct command line switches. I will write more about this in a future entry.

We still allow "monkey patching", where a source file (re)defines words in another vocabulary; also, circular dependencies are handled properly. However, both monkey patching and circularity are discouraged; if a clean solution can be found without resorting to either, it is highly preferred, and Factor's developer tools are not as effective with code which uses these tricks. Neither technique will be prohibited, though; Factor is a power tool, not plastic cutlery.

Until now, Factor treated source files are essentially saved listener interactions; loading a source file had the effect of adding new definitions and replacing existing ones, just as if you had entered them in the listener. In effect, the operation of loading a source file was a function of the current image state together with the contents of the file. This simplistic approach is used by every interactive language that I'm aware of, including Lisp, Ruby, Python, etc. However, it has a number of serious flaws which only manifest themselves when one reloads code on the fly, and makes heavy changes to the system in the process. Some of the issues are:
  • Suppose you load a source file which defines a set of words, then you edit the file and remove one of the words. If you reload the source file again, the removed word is still in the image.
  • More seriously, if you move a word from one vocabulary to another and reload both into the image, the old definition is still part of the old vocabulary; other source files might pick it up, depending on search order, and this can lead to very surprising and unexpected behavior.
  • Similarly, if a method is removed and the source file reloaded, then this method will still be part of the image.

These can all be fixed by judicious use of the forget word; after reloading a file, any definitions which were removed and should no longer be part of the image can be forgotten manually. This was tedious and error-prone, but at least it was possible. However there were two other potential problems which did not have a good workaround:
  • If you load a source file which looks like this:
    : A ... ;
    : B ... A ... ;
    : C ... B ... ;

    then in the process of refactoring, accidentally move C before B:
    : A ... ;
    : C ... B ... ;
    : B ... A ... ;

    the source file will still load into the image just fine. However, when you test this file in a fresh image, you will get a parse error, because B is been defined when the definition of C is being parsed.

  • If you load a source file into the image, then remove a word definition which other words depend on, then reload the source file, there is no error. While usually I check usages before renaming or removing stuff, sometimes I'd forget; then some time later, I'd load the code into a fresh image, only to be greeted with a "word not found" error.

  • The final issue is perhaps the most subtle. If you accidentally define two words with the same name in one source file, Factor would not complain; after all, it could not distinguish between an erroneous duplicate definition and a legitimate redefinition filed in by the developer in order to fix a bug in running code.

Now, all of these problems have been completely resolved. Here is how it works. To each source file, the parser associates a list of definitions -- this includes words, methods, help articles, and so on. When a source file is being reloaded, the previous list of definitions is saved first. When the parser encounters a word which is in the previous definition list but not yet in the current definition list for the source file being loaded, it throws an error, signaling that you have an erroneous forward reference; the parser can detect when a word refers to another word which is defined later in the file, even if both words are already in the image from a previous load of the file. This error has a single "continue" restart; if you know what you're doing, you can keep loading the file. Dually, if a word is being defined which is already in the current definition list for the source file, you have a duplicate definition and this is definitely an error; there is no legitimate reason you'd define two words with the same name in one vocabulary, because the former word could never be called.

When parsing is over, the parser compares the new set of definitions against the previous one. It builds a list of definitions from the previous set which are not in the new set; these are obsolete definitions which were at one point present in the source file, but not anymore. The parser checks the cross-referencing database to see if any words use these obsolete definitions; if any words do, a warning is printed listing the obsolete definitions together with their callers. Then, the parser calls forget on each obsolete definition, removing them from the image.

Of course, this could all be simpler if the parser would just forget all definitions from a source file before reloading the source file, but this will not work; if other source files refer to these definitions, then forgetting and redefining them will not affect existing usages, which will continue to refer to the (now inaccessible to new parser instances) forgotten definitions.

Module entry points, used by the run word, are still supported, and are now associated with vocabularies. Module master help articles have no equivalent yet, but will be implemented shortly, and again will be associated with vocabularies. This means each vocabulary will have a help page now.

One last thing. In the past, we had a convention where unsafe words, words which could leave objects in an inconsistent state, and deep implementation detail were placed in vocabularies whose names had been suffixed with -internals; for example, we had a sequences-internals containing sequence access primitives which bypass bounds checks, and a math-internals with type-specific math words, such as float+, from which the generic math operations in the math vocabulary were built. This scheme clearly marked unsafe words as such, to avoid confusing beginners or allowing them to crash the VM; but still allowed experts to access them with a simple USING: statement.

With the new module system, we have a very similar convention, except instead of -internals the suffix is .private. Additionally, instead of writing
IN: foo

: A ... ;

IN: foo.private

: B ... ;

IN: foo

: C ... ;

we have a bit of syntax sugar:
IN: foo

: A ... ;

<PRIVATE

: B ... ;

PRIVATE>

: C ... ;


A fair bit of work still remains but I hope to finish the new module system, and document it, by the end of the week.

Finally, I'd like to say that I'm really, really pleased with the new parser feature for catching load order issues and definition removal. This is something I'd expect Lisp implementations to do, but I don't think any attempt it; as far as I know, there is no Lisp implementation out there which does the right thing when one reloads a source file after removing a function or method.

1 comment:

Anonymous said...

Unrelated comment. Perhaps take a look at the Refal R-expression data type

http://www.refal.net/english/xmlref_1.htm

This idea might not add anything to Factor, but other concepts in the language might be of interest.

A more modern implementation:

http://wiki.botik.ru/Refaldevel/RefalPlusEn?TWIKISID=a15baaa51b7be6f8930f34bd0062035b