Factor: a practical stack language: April 2010

Friday, April 16, 2010

Factor 0.93 now available

After two months of development, Factor 0.93 is now available for download from the Factor website. A big thanks to all the contributors, testers and users. This release would not be possible without your help. A summary of the most important changes follows:

Incompatible changes:

Factor no longer supports NetBSD, due to limitations in that operating system. (Slava Pestov)
sets, hash-sets, bit-sets: these vocabularies have been redesigned around a new generic protocol for sets (Daniel Ehrenberg)
Strings can no longer be used to refer to C types in FFI calls; you must use C type words instead. Also, ABI specifiers are now symbols rather than strings. So for example, the following code will no longer work:
```
: my-callback ( -- callback ) "int" { "int" "int" } "cdecl" [ + ] alien-callback ;
```
you must now write:
```
: my-callback ( -- callback ) int { int int } cdecl [ + ] alien-callback ;
```
(Joe Groff)
The behavior of string types has changed. The char* C type is now just a bare pointer; to get the automatic conversion to and from Factor strings, use the c-string type. See the documentation for details. (Joe Groff)
FFI function return values which are pointers to structs are now boxed in a struct class, instead of returning a bare alien. This means that many memory>struct calls can be removed. If the return value is actually a pointer to an array of structs, use specialized arrays as before. (Joe Groff)
C-ENUM: now takes an additional parameter which is either the name of a new C type to define, or f. This type is aliased to int. (Erik Charlebois)
The stack checker now supports row polymorphic stack effects, for improved error checking. See the documentation for details. (Joe Groff)

New features:

Co-operative threads now use efficient context-switching primitives instead of copying stacks with continuations (Slava Pestov)
Added final class declarations, to prohibit classes from being subclassed. This enables a few compiler optimizations (Slava Pestov)
Added platform support metadata for vocabularies. Vocabulary directories can now contain a platforms.txt file listing operating system names which they can be loaded under. (Slava Pestov)
The deploy tool can now deploy native libraries and resource files, and supports custom icons. (Joe Groff)
Joint behavior of vocabularies - the new require-when word can be used to express that a vocabulary should be loaded if two existing vocabularies have been loaded (Daniel Ehrenberg)
Callstack overflow is now caught and reported on all platforms except for Windows and OpenBSD. (Slava Pestov)
The prettyprinter now has some limits set by default. This prevents out of memory errors when printing large structures, such as the global namespace. Use the without-limits combinator to disable limits. (Slava Pestov)
Added fastcall and thiscall ABIs on x86-32 (Joe Groff)
The build farm's version release process is now more automated (Slava Pestov)

Improved libraries:

delegate: add BROADCAST: syntax, which delegates a generic word with no outputs to an array of multiple delegates. (Joe Groff)
game.input: X11 support. (William Schlieper)
gpu: geometry shader support. (Joe Groff)
opengl: OpenGL 4.0 support. (Joe Groff)

New libraries:

bit.ly: Factor interface to the bit.ly URL shortening service. (Slava Pestov)
chipmunk: binding for Chipmunk 2D physics library. (Erik Charlebois)
cuda: binding to NVidia's CUDA API for GPU computing. (Doug Coleman, Joe Groff)
cursors: experimental library for iterating over collections, inspired by STL iterators. (Joe Groff)
elf: parsing ELF binaries. The elf.nm vocabulary is a demo which prints all symbols in an ELF binary. (Erik Charlebois)
images.pbm, images.pgm, images.ppm: libraries for loading PBM, PGM and PPM images (Erik Charlebois)
macho: parsing Mach-O binaries. (Erik Charlebois)
opencl: binding for the OpenCL standard interface for GPU computing. (Erik Charlebois)
path-finding: implementation of A* and BFS path finding algorithms. (Samuel Tardieu)
windows.ddk: binding for Windows hardware abstraction layer. (Erik Charlebois)

Thursday, April 08, 2010

Frame-based structured exception handling on Windows x86-64

Factor used to use vectored exception handlers, registered with AddVectoredExceptionHandler, however vectored handlers are somewhat problematic. A vectored handler is always called prior to any frame-based handlers, so Factor could end up reporting bogus exceptions if the FFI is used to call a library that uses SEH internally. This prompted me to switch to frame-based exception handlers. Unfortunately, these are considerably more complex to use, and the implementation differs between 32-bit and 64-bit Windows.

I briefly discussed frame-based SEH on 32-bit Windows in my previous post. During the switch to 64 bits, Microsoft got rid of the old frame-based SEH implementation and introduced a new, lower-overhead approach. Instead of pushing and popping exception handlers onto a linked list at runtime, the system maintains a set of function tables, where each function table stores exception handling and stack frame unwinding information.

Normally, the 64-bit Windows function tables are written into the executable by the native compiler. However, language implementations which generate code at runtime need to be able to define new function tables dynamically. This is done with the RtlAddFunctionTable() function.

It took me a while to figure out the correct way to call this function. I found the os_windows_x86.cpp source file from Sun's HotSpot Java implementation was very helpful, and I based my code on the register_code_area() function from this file.

Factor and HotSpot only use function tables in a very simple manner, to set up an exception handler. Function tables can also be used to define stack unwinding behavior; this allows debuggers to generate backtraces, and so on. Doing that is more complicated and I don't understand how it works, so I won't attempt to discuss it here.

The RtlAddFunctionTable() function takes an array of RUNTIME_FUNCTION structures and a base address. For some unknown reason, all pointers in the structures passed to this function are 32-bit integers relative to the base address.

For a runtime compiler that does not need to perform unwinding, it suffices to map the entire code heap to one RUNTIME_FUNCTION. A RUNTIME_FUNCTION has three fields:

BeginAddress - the start of the function
EndAddress - the end of the function
UnwindData - a pointer to unwind data

All pointers are relative to the base address passed into RtlAddFunctionTable(). The unwind data can take various forms. For the simple case of no unwind codes and an exception handler, the following structure is used:

struct UNWIND_INFO {
    UBYTE Version:3;
    UBYTE Flags:5;
    UBYTE SizeOfProlog;
    UBYTE CountOfCodes;
    UBYTE FrameRegister:4;
    UBYTE FrameOffset:4;
    ULONG ExceptionHandler;
    ULONG ExceptionData[1];
};

The Version and Flags fields should be set to 1, the ExceptionHandler field set to a function pointer, and the rest of the fields set to 0. The exception handler pointer must be within relative to the base address, and it must also be within the memory range specified by the BeginAddress and EndAddress fields of the RUNTIME_FUNCTION structure. The exception handler has the same function signature as in the 32-bit SEH case:

LONG exception_handler(PEXCEPTION_RECORD e, void *frame, PCONTEXT c, void *dispatch)

In both Factor and HotSpot, the exception handler is a C function, however the RtlAddFunctionTable() API requires that it be within the bounds of the runtime code heap. To get around the restriction, both VMs allocate a small trampoline in the code heap which simply jumps to the exception handler, and use a pointer to the trampoline instead. Similarly, because the "pointers" in these structures are actually 32-bit integers, it helps to allocate the RUNTIME_FUNCTION and UNWIND_INFO in the code heap as well, to ensure that everything is within the same 4Gb memory range.

The above explanation probably didn't make much sense, so I invite you to check out the source code instead: os-windows-nt-x86.64.cpp.

Sunday, April 04, 2010

Switching call stacks on different platforms

User-space implementations of coroutines and green threads need to be able to switch the CPU's call stack pointer between different memory regions. Since this is inherently CPU- and OS-specific, I'll limit my discussion to CPUs and platforms that Factor supports.

System APIs for switching call stacks

Windows has an API for creating and switching between contexts, which it calls "fibers". The main functions to look at are CreateFiber() and SwitchToFiber(). On Windows, by far the easiest way to switch call stacks is to just use fibers.
Most Unix systems have a set of functions operating on ucontexts, such as makecontext(), swapcontext() and so on. On some systems, these APIs are poorly implemented and documented.
The C standard library functions setjmp() and longjmp() can be (ab)used to switch contexts. The jmpbuf structure stored to by setjmp() and read from by longjmp() contains a snapshot of all registers, including the call stack pointer. Once you've allocated your own call stack, you can capture a jmp_buf with setjmp(), change the call stack pointer, and then resume execution with longjmp(). As far as I know, this is completely undocumented and unsupported with every major C compiler.
The most direct way is to write some assembly to switch the relevant registers directly. Paradoxically, this approach is the most portable out of all of these, because while it is CPU-specific, the details are pretty much the same across most OSes, Windows being the exception. Switching call stacks on Windows is a bit more involved than just changing the ESP register, and requires updating some Windows-specific in-memory structures. However, it is still easy enough to do directly, if you do not wish to use the fiber API. I will describe the details at the end of this post.

High-level libraries for switching call stacks

A couple of existing libraries implement high-level, cross-platform abstractions using some combination of the above mechanisms.

libcoroutine uses fibers on Windows, and ucontext and longjmp on Unix.
libcoro uses a handful of Unix-specific APIs if they're available, or falls back onto some hand-coded assembly routines. The latter works on Windows.

NetBSD/x86 limitation

There is no realiable way to switch call stacks on NetBSD/x86, because of a limitation in the implementation of libpthread. Even if your program doesn't use pthreads, it might link with a library that does, such as Glib. The pthread library replaces various C standard library functions with its own versions, and since this includes some extremely common calls such as malloc(), almost nothing will work as a result.

The reason is that the NetBSD pthread library uses a silly trick to implement thread-local storage, one which unfortunately assumes that every native thread has exactly one call stack. The trick is to always allocate thread call stacks on multiples of the maximum call stack size. Then, each thread's unique ID can just be the result of masking the call stack pointer by the maximum call stack size.

Windows-specific concerns

So far, I've only worked out the details of switching call stacks on 32-bit Windows. I'll talk about 64-bit Windows in another post.

Windows has a mechanism known as Structured Exception Handling, which is a sort of scoped exception handling mechanism with kernel support. Windows uses SEH to deliver processor traps to user-space applications (illegal memory access, division by zero, etc). Some C++ compilers also use SEH to implement C++ exceptions.

On Windows, simply changing the ESP register to point at another memory region is insufficient because structured exception handling must know the current call stack's beginning and end, as well as a pointer to the innermost exception handler record.

These three pieces of information are stored in a per-thread information block, known as the TIB. The fiber switching APIs update the TIB for you, which is why Microsoft recommends their use.

The TIB

The TIB is defined by the NT_TIB struct in winnt.h:

typedef struct _NT_TIB {
    struct _EXCEPTION_REGISTRATION_RECORD *ExceptionList;
    PVOID StackBase;
    PVOID StackLimit;
    PVOID SubSystemTib;
    _ANONYMOUS_UNION union {
        PVOID FiberData;
        DWORD Version;
    } DUMMYUNIONNAME;
    PVOID ArbitraryUserPointer;
    struct _NT_TIB *Self;
} NT_TIB,*PNT_TIB;

The relevant fields that must be updated when switching call stacks are ExceptionList, StackBase and StackLimit.

Note that since the x86 call stack grows down, StackLimit is the start of the call stack's memory region and StackBase is the end.

Structured exception handlers

Structured exception handlers are stored in a linked list on the call stack, with the head of the list pointed at by the ExceptionList field of the TIB:

struct _EXCEPTION_REGISTRATION_RECORD
{
     PEXCEPTION_REGISTRATION_RECORD Next;
     PEXCEPTION_DISPOSITION Handler;
} EXCEPTION_REGISTRATION_RECORD, *PEXCEPTION_REGISTRATION_RECORD;

The exception list should be saved and restored when switching between call stacks, and a new call stack should begin with an empty list (ExceptionList set to NULL).

If the ExceptionList field of the TIB or the Next field of an exception record point outside the call stack, then the handler in question will not be called at all.

Accessing the TIB

On x86-32, the current thread's TIB is stored starting at address 0 in the segment pointed at by the FS segment register. The Self field always points at the struct itself. Since Windows uses flat addressing, the segment storing the TIB begins somewhere in the linear 32-bit address space, so an assembly routine to fetch the TIB and return a pointer to it in EAX looks like this:

fetch_tib:
    mov eax,[fs:24]
    ret

This assembly routine can then be called from C code, which can manipulate the TIB like any other structure.

Sample code for updating Windows-specific structures

The basis/cpu/x86/32/winnt/bootstrap.factor file defines assembly subroutines for updating the TIB when switching call stacks. These routines are invoked by the x86-32 non-optimizing compiler backend:

Factor: a practical stack language