-no-compilefirst yielded an image which worked fine with a 2gb data heap. And indeed, compiling something as simple as the
+word would cause a crash. Now + is called a lot, so if it is miscompiled Factor doesn't survive long enough to read another line of input in the listener. So instead I used a trick to compile + but put the compiled definition in another word. Testing didn't reveal any problems, though;
0 0 blah .
1 3 blah .
... etc, with various data types, everything worked
However as soon as I swapped in the definition of
+, Factor would instantly crash. Some further investigation revealed this:
0 0 blah 0 =
0 0 blah 0 > .
0 0 blah class .
So this is why my earlier testing didn't reveal the problem. The sum of 0 with 0 became a corrupted bignum, which is apparently larger than 0. But indeed, 0 plus 0 should be a fixnum and not overflow to a bignum at all, much less a corrupted one.
I looked at the disassembly for
+specialized to fixnum arguments; my first suspicion was some kind of singed/unsigned integer issue in the VM, but the assembly generated was identical with a 2gb heap or a typical 64mb one:
0x0516d020: lwz r3,-4(r14)
0x0516d024: lwz r4,0(r14)
0x0516d028: mtxer r0
0x0516d02c: addo. r5,r4,r3
0x0516d030: bns- 0x516d094
... code to handle overflow, if there is one ...
0x0516d094: stw r5,-4(r14)
0x0516d098: addi r14,r14,-4
Single-stepping through this code in gdb revealed that with a 2gb heap, the overflow branch was always taken, even if there was no overflow. The exact same code behaved correctly with the default heap size.
Then it hit me like a ton of bricks. The intention of this instruction is to store a zero in the XER register:
0x0516d028: mtxer r0
But it actually stores the value of the register r0! I'm sure I knew this when I was writing the code, but for some reason I didn't notice I had made a mistake. Well, some two years (!) later, the bug manifests itself.
Why didn't this bug appear earlier? The reason is simple. The Factor code generator doesn't use r0 as a general purpose scratch register, because its not really general purpose; some instructions assume an operand of r0 means a literal zero (perhaps I thought mtxer behaves like this). So the code generator only ever uses r0 to store return addresses in the subroutine prologue/epilogue sequences. So when
+was called with a pair of fixnums, some random return address was being stuffed into XER; not zero as intended! But amazingly, everything worked, until I thought to test Factor with a larger data heap. The reason it worked is also simple; the code heap is mapped in directly after the data heap, so if the code heap was mapped large enough, storing a return address into XER had the effect of enabling the overflow bit, causing the following BNS instruction to always take the branch.
What an embarrassing bug! I hope the compiler implementation Gods can forgive me for forgetting to initialize a register.