Wednesday, August 09, 2006

Reworking the compiler, again

This time I'm refactoring the lowest levels of the compiler, and that is the interface with the runtime system and code heap management. Previously the assembler and compiler would write compiled code directly to the code heap. This originally made the implementation simpler, however there were many problems with this approach:
  • The assembler was impossible to unit test
  • There was duplication between the code to fix up forward references in the compiler, and the code to relocate compiled blocks when loading the image in the runtime
  • Because the compiler could leave the code heap in an inconsistent state, great care had to be taken to deal with literals referenced from compiled code. In particular, they were centralized in one table which could fill up. And code GC was out of the question.

In the new design, the compiler calls a runtime primitive, passing it a bunch of vectors holding machine code and relocation information. This has simplified the Factor side of things and removed the duplication, but made the runtime somewhat more complex. However now that the compiler no longer accesses the code heap using pointer arithmetic primitives, these primitives can be moved out of the runtime and become compiled intrinsics. This open coding of memory access, boxing and unboxing should improve performance of C library calls, among other things. Looking further into the future, the improved code heap structure will allow me to implement call traces for errors thrown from compiled code, a garbage collector for compiled code, and compiled continuations.

Perhaps this refactoring allows cross-compilation, even though I have no plans to implement such a feature at this stage; since we're not writing to a code heap, but simply constructing vectors of numbers, the compiled code could easily be dumped to a file.

Here is a fun example: we can look at the machine code corresponding to assembler sequences -- a PowerPC snippet follows:
  [ 0 1 72 LWZ 24 3 LI ] { } make .
{ 2147549256 945815576 }

In particular the assembler now "assembles" machine code by calling the , word just as in traditional Forth.

No comments: