Sunday, January 10, 2010

Factor's bootstrap process explained

Separation of concerns between Factor VM and library code

The Factor VM implements an abstract machine consisting of a data heap of objects, a code heap of machine code blocks, and a set of stacks. The VM loads an image file on startup, which becomes the data and code heap. It then begins executing code in the image, by calling a special startup quotation.

When new source files are loaded into a running Factor instance by the developer, they are parsed and compiled into a collection of objects -- words, quotations, and other literals, along with executable machine code. The new data and code heaps can then be saved into another image file, for faster loading in the future.

Factor's core data structures, object system, and source parser are implemented in Factor and live in the image, so the Factor VM does not have enough machinery to start with an empty data and code heap and parse Factor source files by itself. Instead, the VM needs to start from a data and code heap that already contains enough Factor code to parse source files. This poses a chicken-and-egg problem; how do you build a Factor system from source code? The VM can be compiled with a C++ compiler, but the result is not sufficient by itself.

Some image-based language systems cannot generate new images from scratch at all, and the only way to create a new image is to snapshot an existing session. This is the simplest approach but it has serious downsides -- it lacks determinism and reproducability, and it is difficult to make big changes to the system.

While Factor can snapshot the current execution state into an image, it also has a tool to generate "fresh" image from source. While this tool is written in Factor and runs inside of an existing Factor system, the resulting images depend as little as possible on the particular state of the system that generated it.

Stage 1 bootstrap: creating a boot image

The initial data heap comes from a boot image, which is built from an existing Factor system, known as the host. The result is a new boot image which can run in the target system. Boot images are created using the bootstrap.image tool, whose main entry point is the make-image tool. This word can be invoked from the listener in a running Factor instance:
"x86.32" make-image

The make-image word parses source files using the host's parser, and the code in those source files forms the target image. This tool can be thought of as a form of cross-compiler, except boot images only contain a data heap, and not a code heap. The code heap is generated on the target by the VM, and later by the target's optimizing compiler in Factor.

The make-image word runs the source file core/bootstrap/stage1.factor, which kicks off the bootstrap process.

Building the embryo for the new image

The global namespace is most important object stored in an image file. The global namespace contains various global variables that are used by the parser, along them the dictionary. The dictionary is a hashtable mapping vocabulary names to vocabulary objects. Vocabulary objects contain various bits of state, among them a hashtable mapping word names to word objects. Word objects store their definition as a quotation. The dictionary is how code is represented in memory in Factor; it is built and modified by loading source files from disk.

One of the tasks performed by stage1.factor is to read the source file core/bootstrap/primitives.factor. This source file creates a minimal global namespace and dictionary that target code can be loaded into. This initial dictionary consists of primitive words corresponding to all primitives implemented in the VM, along with some initial state for the object system, consisting of built-in classes such as array. The code in this file runs in the host, but it constructs objects that will ultimately end up in the boot image of the target.

A second piece of code that runs in order to prepare the environment for the target is the CPU-specific backend for the VM's non-optimizing compiler. Again, these are source files which run on the host:

The non-optimizing compiler does little more than glue chunks of machine code together, so the backends are relatively simple and consist of several dozen short machine code definitions. These machine code chunks are stored as byte arrays, constructed by Factor's x86 and PowerPC assemblers.

Loading core source files

Once the initial global environment consisting of primitives and built-in classes has been prepared, source files comprising the core library are loaded in. From this point on, code read from disk does not run in the host, only in the target. The host's parser is still being used, though.

Factor's vocabulary system loads dependencies automatically, so stage1.factor simply calls require on a few essential vocabularies which end up pulling in everything in the core vocabulary root.

During normal operation, any source code at the top level of a source file, not in any definition, is run when the source file is loaded. During stage1 bootstrap, top-level forms from source files in core are not run on the host. Instead, they need to be run on the target, when the VM is is launched with the new boot image.

After loading all source files from core, this startup quotation is constructed. The startup quotation begins by calling top-level forms in core source files in the order in which they were loaded, and then runs basis/bootstrap/stage2.factor.

Serializing the result

At this point, stage1 bootstrap has constructed a new global namespace consisting of a dictionary, object system meta-data, and other objects, together with a startup quotation which can kick off the next stage of bootstrap.

Data heap objects that the VM needs to know about, such as the global namespace, startup quotation, and non-optimizing compiler definitions, are stored in an array of "special objects". Entries are defined in vm/objects.hpp, and in the image file they are stored in the image header.

This object graph, rooted at the special objects array, is now serialized to disk into an image file. The bootstrap image generator serializes objects in the same format in which they are stored in the VM's heap, but it does this without dumping VM's memory directly. This allows object layouts to be changed relatively easily, by first updating the bootstrap image tool, generating an image with the new layouts, then updating the VM and running the new VM with the new image.

The bootstrap image generator also takes care to write the resulting data with the correct cell size and endianness. Along with containing CPU-specific machine code templates for the non-optimizing compiler, this is what makes boot images platform-specific.

Stage 2 bootstrap: fleshing out a development image

At this point, the host system has writen a boot image file to disk, and the next stage of bootstrap can begin. This stage runs on the target, and is initiated by starting the Factor VM with the new boot image:
./factor -i=boot.x86.32.image
The VM reads the new image into an empty data heap. At this point, it also notices that the boot image does not have a code heap, so it cannot start executing the boot quotation just yet.

Early initialization

Boot images have a special flag set in them which kicks off the "early init" process in the VM. This only takes a few seconds, and entails compiling all words in the image with the non-optimizing compiler. Once this is done, the VM can call the startup quotation. Quotations are also compiled by the non-optimizing compiler the first time they are called.

This startup quotation was constructed during stage1 bootstrap. It runs top-level forms in core source files, then runs basis/bootstrap/stage2.factor.

Loading major subsystems

Whereas the goal of stage1 bootstrap is to generate a minimal image that contains barely enough code to be able to load additional source files, stage2 creates a usable development image containing the optimizing compiler, documentation, UI tools, and everything else that makes it into a recognizable Factor system.

The major vocabularies loaded by stage2 bootstrap include:

Finishing up

The last step taken by stage2 bootstrap is to install a new startup quotation. This startup quotation does the usual command-line processing; if no switches are specified, it starts the UI listener, otherwise it runs a source file or vocabulary given on the command line.

Once the new startup quotation has been installed, the current session is saved to a new image file using the save-image-and-exit word.

No comments: