Monday, May 10, 2010

A collection of small compiler improvements

One of the big optimizations I'm planning on implementing in Factor is support for computations on machine-word-sized integers. The motivation is as follows. While code that operates on small integers (fixnums) does not currently allocate memory for intermediate values, and as a result can compile very efficiently if type checks are eliminated, sometimes fixnum precision is not quite enough. Using bignums in algorithms such as SHA1 that require 32-bit or 64-bit arithmetic incurs a big performance hit over writing the code in a systems language. My plan is to have the compiler support machine integers as an intermediate step between fixnums and bignums. Machine integers would be boxed and unboxed at call boundaries, but tight loops operating on them will run in registers.

While I haven't yet implemented the above optimization, I've laid the groundwork and made it possible to represent operations on untagged integers at the level of compiler IR at least. While unboxed floats do not cause any complications for the GC, unboxed integers do, because they share a register file with tagged pointers. To make Factor's precise GC work, the compiler can now distinguish registers containing tagged pointers from those containing arbitrary integer data, by breaking down the register class of integer registers into two "representations", tagged and untagged.

Since the above reworking touched many different parts of the compiler, I also took the time to implement some minor optimizations and clean up some older code. These improvements will be the topic of this post.

To understand this post better, you should probably skim the list of low-level IR instructions defined in compiler.cfg.instructions first. Also, feel free to post comments here or e-mail me questions about the design of the Factor compiler.

Improvements to value numbering

In the x86 architecture, almost any instruction can take a memory operand, and memory operands take the form (base + displacement * scale + offset), where base and displacement are registers, offset is a 32-bit signed integer, and scale is 1, 2, 4 or 8. While using a complex addressing mode is not any faster than writing out the arithmetic by hand with a temporary register used to store the final address, it can reduce register pressure and code size.

The compiler now makes better use of complex addressing modes. Prior to optimization, the low level IR builder lowers specialized array access to the ##load-memory-imm and ##store-memory-imm instructions. For example, consider the following code:

USING: alien.c-types specialized-arrays ;
SPECIALIZED-ARRAY: float

: make-data ( -- seq ) 10 iota [ 3.5 + sqrt ] float-array{ } map-as ;
The inner loop consists of the following low level IR:
##integer>float 23 14
##sqrt 24 23
##load-integer 25 2
##slot-imm 26 15 2 7
##load-integer 27 4
##mul 28 14 27
##tagged>integer 29 26
##add-imm 30 29 7
##add 31 30 28
##store-memory-imm 24 31 0 float-rep f
##load-integer 32 1
##add 33 14 32

The ##mul, ##add-imm and ##add instructions compute the address input to ##store-memory-imm. After value numbering, these three instructions have been fused with ##store-memory-imm to form ##store-memory:

##integer>float 62 57
##sqrt 63 62
##slot-imm 65 48 2 7
##tagged>integer 68 65
##store-memory 63 68 57 2 7 float-rep f
##add-imm 72 57 1

Optimistic copy propagation

While the low-level optimizer's value numbering and alias analysis passes only operate on individual basic blocks (local optimizations), the copy propagation pass is global, meaning it takes the entire procedure being compiled into account.

Eventually I will merge the local alias analysis and value numbering passes with the copy propagation pass to get an optimization known as "global value numbering with redundant load elimination". One step along this road is a rewrite that makes the copy propagation pass optimistic, in the following sense.

Copy propagation concerns itself with eliminating ##copy instructions, which correspond to assignments from one SSA value to another (new) value:

y = x

When such an instruction is processed, all usages of the value y can be replaced with the value x, in every basic block, and the definition of y deleted.

A simple extension of the above treats a ##phi instruction where all inputs are equal the same as a copy. This optimizes cases such as:

 a = ...
if(...)
{
b = a
}
c = phi(b,a)

The case of ##phi instructions where one of the inputs is carried by a back edge in the control flow graph (ie, a loop) is where the optimistic assumption comes in. Consider the following code, where the definition of x'' is carried by a back edge:

x' = 0

while(...)
{
x = phi(x',x'')
x'' = ...
}

Optimistic copy propagation is able to eliminate the phi node entirely. Switching from pessimistic to optimistic copy propagation eliminated about 10% of copy instructions. While this is not a great amount, it did reduce spilling in a few subroutines I looked at, and there is no increase in code complexity from this new approach.

Representation selection improvements

The low-level IR now includes three new instructions, ##load-double, ##load-float and ##load-vector. These are inteded for loading unboxed floating point values and SIMD vectors into vector registers without using an integer register to temporarily hold a pointer to the boxed value.

Uses of these instructions are inserted by the representation selection pass. If the destination of a ##load-reference is used in an unboxed representation, then instead of inserting a memory load following the ##load-reference, the instruction is replaced with one of the new variants.

Loads from constant addresses directly into floating point and vector registers are only possible on x86, and not PowerPC, so the optimization is not performed on the latter architecture. On x86-32, an immediate displacement can encode any 32-bit address. On x86-64, loading from an absolute 64-bit address requires an integer register, however instruction pointer-relative addressing is supported with a 32-bit displacement. To make use of this capability, the compiler now supports a new "binary literal area" for unboxed data that compiled code can reference. The unboxed data is placed immediately following a word's machine code, allowing RIP-relative addressing to be used on x86-64.

This optimization, together with value numbering improvements, helps reduce pressure on integer registers in floating point loops.

Register allocator improvements

Once representation selection has run, each SSA value has an associated representation and register class. Each representation always belongs to one specific register class; a register class is a finite set of registers from which the register allocator is allowed to choose a register for this type of value.

Before register allocation, the SSA destruction pass transforms ##phi instructions into ##copys, and then subsequently performs live range coalescing to eliminate as many of the copies as possible.

Coalescing two SSA values from different register classes does not make sense; the compiler will not be able to generate valid machine code if a single SSA value is used as both a float and an integer, since on most CPUs the integer and floating point register files are distinct.

However, coalescing values with different representations, but the same register class, is okay. Consider the case where a double-precision float is computed, and then converted into a single precision float, and subsequently stored into memory. If the double-precision value is not used after the conversion, then the same register that held the input value can be reused to store the result of the single-precision conversion.

Previously this pass only coalesced values with identical representations; now I've generalized it to coalescing values with the same register class but possibly different representations. This reduces register pressure and spilling.

The tricky part about making it work is that the register allocator needs to know the value's representation, not just its register class, for generating correct spills and reloads. For example, a spilled value with register class float-regs can use anywhere from 4 to 16 bytes in the stack frame, depending on the specific representation: single precision float, double precision float, or SIMD vector. Clearly, if coalescing simply lost this fine-grained representation information and only retained register classes, the register allocator would not have enough information to generate valid code.

The solution is to have live range coalescing compute the equivalence classes of SSA values without actually renaming any usages to point to the canonical representative. The renaming map is made available to the register allocator. Most places in the register allocator where instruction operands are examined make use of the renaming map.

Until the splitting phase, these equivalence classes are in one-to-one correspondence with live intervals, and each live interval has a single machine register and spill slot. However, when spill and reload decisions are made, the register allocator uses the original SSA values to look up representations.

If Factor's compiler did not use SSA form at all, there would still be a copy coalescing pass, and the register allocator could also support coalescing values with different representations, by first performing a dataflow analysis known as reaching definitions. This would propagate representation information to use sites.

Retaining SSA structure all the way until register allocation gives you the results of this reaching definitions analysis "for free". In both the SSA and non-SSA case, you still don't want definitions with multiple different representations reaching the same call site. The SSA equivalent of this would be a phi instruction whose inputs had different representations. The compiler ensures this doesn't happen by first splitting up the set of SSA values into strongly-connected components (SCCs) whose edges are phi nodes. The same representation is then assigned to each member of every SCC; if an agreeable representation is not found then it falls back on boxing and tagging all members.

Notice that while both representation selection and SSA destruction group values into equivalence classes, the groupings do not correspond to each other and neither one is a refinement of the other. Not all copies resulting from phi nodes get coalesced away, so a single SCC may intersect multiple coalescing classes. This can happen if the first input to a phi node is live at the definition point of the second:

b = ...

if(...)
{
a = ...
foo(b)
}
/* 'c' can be coalesced with 'a' or 'b' -- but not both.
All three values belong to the same SCC and must have the same
representation */
c = phi(a,b)
On the other hand, two values from different SCCs might also get coalesced together:

a = some-double-float
/* if 'a' is not live after this instruction, then 'a' and 'b'
can share a register. They belong to different SCCs and have
different representations */
b = double-to-single-precision-float(a)

Compiler code cleanups

In addition to just adding new optimizations I've also simplified and refactored the internals of some of the passes I worked on.

One big change is that the "machine representation" used at the very end of the compilation process is gone. Previously, after register allocation had run, the control flow graph would be flattened into a list of instructions with CFG edges replaced by labels and jumps. The code generator would then iterate over this flat representation and emit machine code for each virtual instruction. However since no optimization passes ran on the flat representation, it was generated and then immediately consumed. Combining the linearization pass with the code generation pass let me eliminate about a dozen IR instructions that were specific to the flat representation (anything dealing with labels and jumps to labels). It also sped up compilation time slightly.

Value numbering's common subexpression elimination now uses a simpler and more efficient scheme for hashed lookup of expressions. New algebraic simplifications are also easier to add now, because the def-use graph is easier to traverse.

Some instructions which could be expressed in terms of others were removed, namely the ##string-nth and ##set-string-nth-fast instructions. Reading and writing characters from strings can be done by combining simpler memory operations already present in the IR.

A number of instructions were combined. All of the instructions for reading from memory in different formats, ##alien-unsigned-1, ##alien-signed-4, ##alien-double, have been merged into a single ##load-memory instruction which has a constant operands storing the C type and representation of the data being loaded. Similarly, ##set-alien-signed-8, etc have all been merged into ##store-memory. This restructuring allows optimization passes to more easily treat all memory accesses in a uniform way.

A bug fix for alias analysis

While working on the above I found an interesting bug in alias analysis. It would incorrectly eliminate loads and stores in certain cases. The optimizer assumed that a pointer to a freshly-allocated object could never alias a pointer read from the heap. However the problem of course is that such a pointer could be stored in another object, and then read back in.

Here is a Factor example that demonstrates the problem. First, a couple of tuple definitions; the type declarations and initial value are important further on.

USING: typed locals ;

TUPLE: inner { value initial: 5 } ;
TUPLE: outer { value inner } ;

Now, a very contrieved word which takes two instances of outer,
mutates one, and reads a value out of the other:

TYPED: testcase ( x: outer y: outer -- )
inner new :> z ! Create a new instance of 'inner'
z x value<< ! Store 'z' in 'x'
y value>> :> new-z ! Load 'new-z' from 'y'
10 new-z value<< ! Store a value inside 'new-z'
z value>> ! Read a value out of 'z'
;

Note that the initial value of the value slot in the inner tuple is 5, so inner new creates a new instance holding the value 5. In the case where x and y point at the same object, new-z and z will point to the same object, and the newly-allocated object's value is set to 10. However, the compiler did not realize this, and erronously constant-folded z value>> down to 5!

This bug never came up in actual code; the facepalm moment came while I was doing some code refactoring and noticed that the above case was not being handled.

Instruction scheduling to reduce register pressure

I'm not the only one who's been busy improving the Factor compiler. One of the co-inventors of Factor has cooked up an instruction scheduler and improved method inlining. The scheduler has been merged, and the it looks like the method inlining improvements should be ready in a few days.

Some benchmarks

Here is a comparison between the April 17th build of Factor with the latest code from the GIT repository. I ran a few of the language shootout benchmarks on a 2.4 GHz MacBook Pro. Factor was built in 32-bit mode, because the new optimizations are most effective on the register-starved x86.

BenchmarkBefore (ms)After (ms)
nbody-simd415349
fasta26672485
knucleotide182177
regex-dna10986
spectral-norm16411347

Friday, April 16, 2010

Factor 0.93 now available

After two months of development, Factor 0.93 is now available for download from the Factor website. A big thanks to all the contributors, testers and users. This release would not be possible without your help. A summary of the most important changes follows:

Incompatible changes:

  • Factor no longer supports NetBSD, due to limitations in that operating system. (Slava Pestov)
  • sets, hash-sets, bit-sets: these vocabularies have been redesigned around a new generic protocol for sets (Daniel Ehrenberg)
  • Strings can no longer be used to refer to C types in FFI calls; you must use C type words instead. Also, ABI specifiers are now symbols rather than strings. So for example, the following code will no longer work:
    : my-callback ( -- callback ) "int" { "int" "int" } "cdecl" [ + ] alien-callback ;
    you must now write:
    : my-callback ( -- callback ) int { int int } cdecl [ + ] alien-callback ;
    (Joe Groff)
  • The behavior of string types has changed. The char* C type is now just a bare pointer; to get the automatic conversion to and from Factor strings, use the c-string type. See the documentation for details. (Joe Groff)
  • FFI function return values which are pointers to structs are now boxed in a struct class, instead of returning a bare alien. This means that many memory>struct calls can be removed. If the return value is actually a pointer to an array of structs, use specialized arrays as before. (Joe Groff)
  • C-ENUM: now takes an additional parameter which is either the name of a new C type to define, or f. This type is aliased to int. (Erik Charlebois)
  • The stack checker now supports row polymorphic stack effects, for improved error checking. See the documentation for details. (Joe Groff)

New features:

  • Co-operative threads now use efficient context-switching primitives instead of copying stacks with continuations (Slava Pestov)
  • Added final class declarations, to prohibit classes from being subclassed. This enables a few compiler optimizations (Slava Pestov)
  • Added platform support metadata for vocabularies. Vocabulary directories can now contain a platforms.txt file listing operating system names which they can be loaded under. (Slava Pestov)
  • The deploy tool can now deploy native libraries and resource files, and supports custom icons. (Joe Groff)
  • Joint behavior of vocabularies - the new require-when word can be used to express that a vocabulary should be loaded if two existing vocabularies have been loaded (Daniel Ehrenberg)
  • Callstack overflow is now caught and reported on all platforms except for Windows and OpenBSD. (Slava Pestov)
  • The prettyprinter now has some limits set by default. This prevents out of memory errors when printing large structures, such as the global namespace. Use the without-limits combinator to disable limits. (Slava Pestov)
  • Added fastcall and thiscall ABIs on x86-32 (Joe Groff)
  • The build farm's version release process is now more automated (Slava Pestov)
Improved libraries:
  • delegate: add BROADCAST: syntax, which delegates a generic word with no outputs to an array of multiple delegates. (Joe Groff)
  • game.input: X11 support. (William Schlieper)
  • gpu: geometry shader support. (Joe Groff)
  • opengl: OpenGL 4.0 support. (Joe Groff)
New libraries:
  • bit.ly: Factor interface to the bit.ly URL shortening service. (Slava Pestov)
  • chipmunk: binding for Chipmunk 2D physics library. (Erik Charlebois)
  • cuda: binding to NVidia's CUDA API for GPU computing. (Doug Coleman, Joe Groff)
  • cursors: experimental library for iterating over collections, inspired by STL iterators. (Joe Groff)
  • elf: parsing ELF binaries. The elf.nm vocabulary is a demo which prints all symbols in an ELF binary. (Erik Charlebois)
  • images.pbm, images.pgm, images.ppm: libraries for loading PBM, PGM and PPM images (Erik Charlebois)
  • macho: parsing Mach-O binaries. (Erik Charlebois)
  • opencl: binding for the OpenCL standard interface for GPU computing. (Erik Charlebois)
  • path-finding: implementation of A* and BFS path finding algorithms. (Samuel Tardieu)
  • windows.ddk: binding for Windows hardware abstraction layer. (Erik Charlebois)

Thursday, April 08, 2010

Frame-based structured exception handling on Windows x86-64

Factor used to use vectored exception handlers, registered with AddVectoredExceptionHandler, however vectored handlers are somewhat problematic. A vectored handler is always called prior to any frame-based handlers, so Factor could end up reporting bogus exceptions if the FFI is used to call a library that uses SEH internally. This prompted me to switch to frame-based exception handlers. Unfortunately, these are considerably more complex to use, and the implementation differs between 32-bit and 64-bit Windows.

I briefly discussed frame-based SEH on 32-bit Windows in my previous post. During the switch to 64 bits, Microsoft got rid of the old frame-based SEH implementation and introduced a new, lower-overhead approach. Instead of pushing and popping exception handlers onto a linked list at runtime, the system maintains a set of function tables, where each function table stores exception handling and stack frame unwinding information.

Normally, the 64-bit Windows function tables are written into the executable by the native compiler. However, language implementations which generate code at runtime need to be able to define new function tables dynamically. This is done with the RtlAddFunctionTable() function.

It took me a while to figure out the correct way to call this function. I found the os_windows_x86.cpp source file from Sun's HotSpot Java implementation was very helpful, and I based my code on the register_code_area() function from this file.

Factor and HotSpot only use function tables in a very simple manner, to set up an exception handler. Function tables can also be used to define stack unwinding behavior; this allows debuggers to generate backtraces, and so on. Doing that is more complicated and I don't understand how it works, so I won't attempt to discuss it here.

The RtlAddFunctionTable() function takes an array of RUNTIME_FUNCTION structures and a base address. For some unknown reason, all pointers in the structures passed to this function are 32-bit integers relative to the base address.

For a runtime compiler that does not need to perform unwinding, it suffices to map the entire code heap to one RUNTIME_FUNCTION. A RUNTIME_FUNCTION has three fields:

  • BeginAddress - the start of the function
  • EndAddress - the end of the function
  • UnwindData - a pointer to unwind data

All pointers are relative to the base address passed into RtlAddFunctionTable(). The unwind data can take various forms. For the simple case of no unwind codes and an exception handler, the following structure is used:

struct UNWIND_INFO {
UBYTE Version:3;
UBYTE Flags:5;
UBYTE SizeOfProlog;
UBYTE CountOfCodes;
UBYTE FrameRegister:4;
UBYTE FrameOffset:4;
ULONG ExceptionHandler;
ULONG ExceptionData[1];
};

The Version and Flags fields should be set to 1, the ExceptionHandler field set to a function pointer, and the rest of the fields set to 0. The exception handler pointer must be within relative to the base address, and it must also be within the memory range specified by the BeginAddress and EndAddress fields of the RUNTIME_FUNCTION structure. The exception handler has the same function signature as in the 32-bit SEH case:

LONG exception_handler(PEXCEPTION_RECORD e, void *frame, PCONTEXT c, void *dispatch)

In both Factor and HotSpot, the exception handler is a C function, however the RtlAddFunctionTable() API requires that it be within the bounds of the runtime code heap. To get around the restriction, both VMs allocate a small trampoline in the code heap which simply jumps to the exception handler, and use a pointer to the trampoline instead. Similarly, because the "pointers" in these structures are actually 32-bit integers, it helps to allocate the RUNTIME_FUNCTION and UNWIND_INFO in the code heap as well, to ensure that everything is within the same 4Gb memory range.

The above explanation probably didn't make much sense, so I invite you to check out the source code instead: os-windows-nt-x86.64.cpp.

Sunday, April 04, 2010

Switching call stacks on different platforms

User-space implementations of coroutines and green threads need to be able to switch the CPU's call stack pointer between different memory regions. Since this is inherently CPU- and OS-specific, I'll limit my discussion to CPUs and platforms that Factor supports.

System APIs for switching call stacks

  • Windows has an API for creating and switching between contexts, which it calls "fibers". The main functions to look at are CreateFiber() and SwitchToFiber(). On Windows, by far the easiest way to switch call stacks is to just use fibers.
  • Most Unix systems have a set of functions operating on ucontexts, such as makecontext(), swapcontext() and so on. On some systems, these APIs are poorly implemented and documented.
  • The C standard library functions setjmp() and longjmp() can be (ab)used to switch contexts. The jmpbuf structure stored to by setjmp() and read from by longjmp() contains a snapshot of all registers, including the call stack pointer. Once you've allocated your own call stack, you can capture a jmp_buf with setjmp(), change the call stack pointer, and then resume execution with longjmp(). As far as I know, this is completely undocumented and unsupported with every major C compiler.
  • The most direct way is to write some assembly to switch the relevant registers directly. Paradoxically, this approach is the most portable out of all of these, because while it is CPU-specific, the details are pretty much the same across most OSes, Windows being the exception. Switching call stacks on Windows is a bit more involved than just changing the ESP register, and requires updating some Windows-specific in-memory structures. However, it is still easy enough to do directly, if you do not wish to use the fiber API. I will describe the details at the end of this post.

High-level libraries for switching call stacks

A couple of existing libraries implement high-level, cross-platform abstractions using some combination of the above mechanisms.

  • libcoroutine uses fibers on Windows, and ucontext and longjmp on Unix.
  • libcoro uses a handful of Unix-specific APIs if they're available, or falls back onto some hand-coded assembly routines. The latter works on Windows.

NetBSD/x86 limitation

There is no realiable way to switch call stacks on NetBSD/x86, because of a limitation in the implementation of libpthread. Even if your program doesn't use pthreads, it might link with a library that does, such as Glib. The pthread library replaces various C standard library functions with its own versions, and since this includes some extremely common calls such as malloc(), almost nothing will work as a result.

The reason is that the NetBSD pthread library uses a silly trick to implement thread-local storage, one which unfortunately assumes that every native thread has exactly one call stack. The trick is to always allocate thread call stacks on multiples of the maximum call stack size. Then, each thread's unique ID can just be the result of masking the call stack pointer by the maximum call stack size.

Windows-specific concerns

So far, I've only worked out the details of switching call stacks on 32-bit Windows. I'll talk about 64-bit Windows in another post.

Windows has a mechanism known as Structured Exception Handling, which is a sort of scoped exception handling mechanism with kernel support. Windows uses SEH to deliver processor traps to user-space applications (illegal memory access, division by zero, etc). Some C++ compilers also use SEH to implement C++ exceptions.

On Windows, simply changing the ESP register to point at another memory region is insufficient because structured exception handling must know the current call stack's beginning and end, as well as a pointer to the innermost exception handler record.

These three pieces of information are stored in a per-thread information block, known as the TIB. The fiber switching APIs update the TIB for you, which is why Microsoft recommends their use.

The TIB

The TIB is defined by the NT_TIB struct in winnt.h:


typedef struct _NT_TIB {
struct _EXCEPTION_REGISTRATION_RECORD *ExceptionList;
PVOID StackBase;
PVOID StackLimit;
PVOID SubSystemTib;
_ANONYMOUS_UNION union {
PVOID FiberData;
DWORD Version;
} DUMMYUNIONNAME;
PVOID ArbitraryUserPointer;
struct _NT_TIB *Self;
} NT_TIB,*PNT_TIB;

The relevant fields that must be updated when switching call stacks are ExceptionList, StackBase and StackLimit.

Note that since the x86 call stack grows down, StackLimit is the start of the call stack's memory region and StackBase is the end.

Structured exception handlers

Structured exception handlers are stored in a linked list on the call stack, with the head of the list pointed at by the ExceptionList field of the TIB:

struct _EXCEPTION_REGISTRATION_RECORD
{
PEXCEPTION_REGISTRATION_RECORD Next;
PEXCEPTION_DISPOSITION Handler;
} EXCEPTION_REGISTRATION_RECORD, *PEXCEPTION_REGISTRATION_RECORD;

The exception list should be saved and restored when switching between call stacks, and a new call stack should begin with an empty list (ExceptionList set to NULL).

If the ExceptionList field of the TIB or the Next field of an exception record point outside the call stack, then the handler in question will not be called at all.

Accessing the TIB

On x86-32, the current thread's TIB is stored starting at address 0 in the segment pointed at by the FS segment register. The Self field always points at the struct itself. Since Windows uses flat addressing, the segment storing the TIB begins somewhere in the linear 32-bit address space, so an assembly routine to fetch the TIB and return a pointer to it in EAX looks like this:

fetch_tib:
mov eax,[fs:24]
ret

This assembly routine can then be called from C code, which can manipulate the TIB like any other structure.

Sample code for updating Windows-specific structures

The basis/cpu/x86/32/winnt/bootstrap.factor file defines assembly subroutines for updating the TIB when switching call stacks. These routines are invoked by the x86-32 non-optimizing compiler backend:

Tuesday, March 23, 2010

Incompatible change in Mach exception handler behavior on Mac OS X 10.6

On Mac OS X, Factor uses Mach exceptions rather than Unix signals to receive notifications of illegal memory accesses and arithmetic exceptions from the operating system. This is used to catch datastack underflows, division by zero, and the like. The code implementing this can be found in vm/mach_signal.c in the Factor repository. This file is based on code from the GNU libsigsegv project, with special permission to use it under a BSD license instead of the restrictive GPL.

It seems that as of Mac OS X 10.6, exceptions raised by child processes are now reported to the parent if the parent has an exception handler thread. This caused a problem in Factor if a child process crashed with an access violation; Factor would think the access violation occurred in Factor code, and die with an assertion failure when attempting to map the thread ID back into a Factor VM object. It seems the simplest way is to add the following clause to the catch_exception_raise() function:

if(task != mach_task_self()) return KERN_FAILURE;

This fix should be applicable to the original libsigsegv code as well, along with other projects, such as CLisp, that use this library.

Thursday, March 11, 2010

Adding a recaptcha to the Factor pastebin

Lately the Factor pastebin has been flooded with penis enlargement spam. The pastebin had its own very weak captcha that worked well enough for a while: there was a form field that was required to be left blank, and if it was not blank validation would fail. However, spammers have started working around it so we needed something stronger. I decided to solve the problem once and for all by integrating Doug Coleman's furnace.recaptcha vocabulary into the pastebin. Doing this was surprisingly easy.

First, I changed the USING: form in webapps.pastebin to reference the furnace.recaptcha vocabulary:

USING: namespaces assocs sorting sequences kernel accessors
hashtables db.types db.tuples db combinators
calendar calendar.format math.parser math.order syndication urls
xml.writer xmode.catalog validators
html.forms
html.components
html.templates.chloe
http.server
http.server.dispatchers
http.server.redirection
http.server.responses
furnace
furnace.actions
furnace.redirection
furnace.auth
furnace.auth.login
furnace.boilerplate
furnace.recaptcha
furnace.syndication
furnace.conversations ;

Next, I changed the validate-entity word. This word is used to validate a form submission when adding a new paste or a new annotation. Instead of validating a captcha using the old simple scheme, it now calls validate-recaptcha:

: validate-entity ( -- )
{
{ "summary" [ v-one-line ] }
{ "author" [ v-one-line ] }
{ "mode" [ v-mode ] }
{ "contents" [ v-required ] }
} validate-params
validate-recaptcha ;

Finally, I edited the new-paste.xml and new-annotation.xml templates to add a recaptcha component inside the t:form tag:

<tr><td colspan="2"><t:recaptcha /></td></tr>

The implementation of the furnace.recaptcha vocabulary is very straightforward and takes advantage of several features of Factor's web framework. It uses http.client to communicate with the recaptcha server, and furnace.conversations to pass the validation error between requests. Finally, it uses the CHLOE: parsing word to define the t:recaptcha tag for use in templates:

CHLOE: recaptcha drop [ render-recaptcha ] [xml-code] ;

Wednesday, February 24, 2010

Final classes and platform-specific vocabularies

Final classes

I added final classes, in the Java sense. Attempting to inherit from a final class raises an error. To declare a class as final, suffix the definition with "final", in the same manner that words can be declared "inline" or "foldable":

TUPLE: point x y z ; final

The motivation for final classes was not obvious. There are three main reasons I added this feature.

Unboxed tuple arrays used to have the caveat that if you store an instance of a subclass into a tuple array of a superclass, then the slots of the subclass would be "sliced off":

TUPLE: point-2d x y ;

TUPLE-ARRAY: point-2d

TUPLE: point-3d < point-2d z ;

SYMBOL: my-array

1 my-array set

1 2 3 point-3d boa my-array get set-first

my-array get first .
=> T{ point-2d { x 1 } { y 2 } }

This warranted a paragraph in the documentation, and vigilance on the part of the programmer. Now, tuple arrays simply enforce that the element class is final, and if it is not, an error is raised. This removes a potential source of confusion; it is always nice when written warnings in the documentation can be replaced by language features.

Joe Groff's typed library has a similar problem. This library has a feature where input and output parameters which are read-only tuples are passed by value to improve performance. This could cause the same "slicing problem" as above. Now, typed only passes final read-only tuples by value.

Finally, there was a previous mechanism for prohibiting subclassing, but it wasn't exposed as part of the syntax. It was used by the implementation of struct classes to prevent subclassing of structs. The struct class implementation now simply declares struct classes as final.

Typed words can now be declared foldable and flushable

Factor has a pretty unique feature; the user can declare words foldable (which makes them eligible for constant folding at compile time if the inputs are all literal) or flushable (which makes them eligible for dead code elimination if the output results are not used). These declarations now work with typed words.

Platform-specific vocabularies

I added a facility to declare what operating systems a vocabulary runs on. Loading a vocabulary on an unsupported platform raises an error, with a restart if you know what you're doing. The load-all word skips vocabularies which are not supported by the current platform.

If a platforms.txt file exists in the vocabulary's directory, this file is interpreted as a newline-separated list of operating system names from the system vocabulary. This complements the existing vocabulary metadata for authors, tags, and summary.

This feature helps the build farm avoid loading code for the wrong platform. It used to be that vocabularies with the "unportable" tag set would be skipped by load-all, however this was too coarse-grained. For example, both the curses and DirectX bindings were tagged as unportable, and so the build farm was not loading or testing them on any platform. However, curses is available on Unix systems, and DirectX is available on Windows systems. With the new approach, there is a extra/curses/platforms.txt file listing unix as a supported platform, and the various DirectX vocabulary directories have platforms.txt files listing windows.