Wednesday, December 07, 2005

AMD64 compiler backend in progress

For the last week or so I've been working on an AMD64 compiler backend. Since the AMD64 architecture is so close to x86, my approach has been to refactor the existing x86 backend into a state where it can support both 32- and 64-bit code generation. So far, this has been a huge success -- if you look at the AMD64 directory in CVS, you will notice it contains a trivial amount of code; namely, an architecture description file, and an assembler extension file. This is in sharp contrast to SBCL, GCC, and others, which chose to implement AMD64 support by making a copy of the x86 source, and changing bits and pieces as required. This results in a massive amount of code duplication.

The first thing I did was add support for the 64-bit instruction set extensions to Factor's x86 assembler. Traditional x86 has 8 directly-addressable registers: EAX, ECX, EDX, EBX, ESI, EDI, EBP, and ESP. The instruction encoding has a byte known as the "MOD-R/M" byte. It reserves 3 bits for the source and destination registers, with 2 bits left over indicating the addressing mode of the destination -- this includes indirect addressing (MOV [EAX],ECX stores the contents of ECX at the address stored in EAX), indirect addressing with a displacement (MOV [ESI+4],EAX), and other more esoteric variations (the so-called scale/index/displacement mode). If the source, rather than the destination, uses a complicated addressing mode, then the direction bit in the opcode byte is set, effectively reversing the roles of the fields of the MOD-R/M byte. This explains why you cannot do memory-to-memory moves on x86; MOV [EAX],[EDX] is not valid since only one of the two operands can have a complex addressing mode.

The AMD64 architecture extends the 8 traditional x86 registers to 64 bits; the 64-bit registers have an R prefix rather than E (x86 old-timers will recall the E prefix was introduced during the move to 32 bits from 16; in 16 bit mode, AX, CX, DX etc are your 8 16-bit registers). AMD64 also introduces 8 new registers, numbered R8 to R15. In typical x86 fashion, these new registers are shoehorned into the instruction set in a pretty bizzare way. A special prefix is used to indicate that an instruction uses 64-bit operands. Since the MOD-R/M byte can only encode 8 distinct registers but AMD64 has 16, there are two bits in the prefix to indicate which of the two operands are extended registers.

This sounds rather convoluted, however the code changes required to the assembler were rather minor. Basically there is a set of 3 words to compute the relevant prefix (or no prefix at all, if the operands are 32-bit registers, or the instruction is one of a small set which do not require a prefix, despite having 64-bit operands). Of course the assembler works exactly as before if no 64-bit registers or instructions are used.

One aspect of the AMD64 architecture which is somewhat annoying is that most instructions still take 32-bit immediates. In fact, the only instruction that can take a 64-bit immediate is the simple form of MOV, where the destination is a register. So in effect,
MOV EAX,1234123412341234    ;; b8 variant permits a 64-bit constant -- valid
MOV [EAX],1234123412341234 ;; c7 variant only permits 32-bit constant -- INVALID

This means, among other things, that a JMP to an address further than 2^31 bytes away from the instruction pointer has to be encoded as two instructions; one to load the address, one to jump (recall that on x86, all jump and branch instructions take a relative address, hence "further away").

Once I had the assembler working, and various issues such as the above ironed out, I proceeded to refactor the generators for the various linear IR nodes (VOPs). The old x86 backend code had a lot of hard-coded registers in the code. There were two reasons for this:
  • x86 weirdness. Some instructions, such as those for multiplication and division, use implicit operands. For example IMUL takes one operand only, and multiplies it by the value in EAX (RAX in 64-bit mode).
  • Scratch registers. My compiler framework had no provision for scratch registers used in VOPs; I would just pull an unused register out of thin air and use it. Of course, this had a huge problem: the VOP's implementation now depended on the VOP's inputs and outputs being tied to certain registers.

I solved the first problem by defining words that would push an assembler register symbol on the stack, then I would call this word from the implementation of various VOPs, instead of referencing the register directly. Then I'd just define the word differently in the x86 and AMD64 architecture description files.

The second problem is solved by a new scratch register framework. Calling it a framework is a bit of a stretch, since it only involves 3 new words. Basically, when a VOP is being generated, the compiler builds a sequence of all registers the VOP takes as input and output. It then takes a list of all scratch registers provided by the architecture, and removes the ones referenced by the VOP. The result is a set of scratch registers the implementation of the VOP can use for whatever purpose.

There is a minor difficulty with this type of random register clobbering. Some VOPs are "basic block" VOPs; basically, any VOP that cannot not result in a change in the flow of control. Consecutive basic block VOPs are merged into a basic block and more aggressive optimization is performed on this basic block. For example, basic block VOPs are assumed not to touch any registers other than their inputs and outputs, and as such, the basic block optimizer can use this assumption to eliminate dead loads and stores. Of course, basic block VOPs thus cannot use the new scratch register framework. This has not been a problem yet, since basic block VOPs tend to be simple, and the only VOPs that use scratch registers are the more complicated ones (for examples, overflow-checking fixnum arithmetic) which are not basic block VOPs.

Finally, I had to fix some code in the x86 backend that would assume the word size was 32 bits. Surprisingly, there was very little code that made this assumption. In fact, it was only used in a handful of places, where a tagged fixnum was being converted into a an object slot offset. Tagged fixnums are cells with the low 3 bits set to zero; the value of the fixnum is the cell shifted to the right by 3 bits. Object slot offsets are mutiples of the word size. On x86, converting from a tagged fixnum to a slot offset is in effect division by two, or a right shift by 1. On AMD64, it is a no-op. Again, I simply abstracted out the call to the assembler sequence to shift a register to the right by 1 in the relevant places, replacing them with a word call. This word was defined in the old way in the x86 architecture description, and as a no-op in the AMD64 description.

The port is not finished yet. While basic code compiles and runs, more complicated words will raise sig11 errors or hang. There are some bugs to work out yet, but it should be pretty easy. Also, I have not yet implemented C library calls, fixnum overflow checks, or code relocation. Here is the status of each of these three items:
  • C library calls will actually require new code to be written; the existing x86 code cannot be adopted to this task. The reason is that the AMD64 calling convention is rather different from x86; function parameters are passed in registers, not on the stack, and stack frames have an elaborate structure that must be set up correctly; on x86 pushing a return address suffices.
  • Fixnum overflow checks actually do work, however the action they must take on overflow, upgrading fixnums to bignums, does not work. The reason is, of course, upgrading a fixnum to a bignum involves calling into the Factor runtime, and the Factor runtime is written in C, thus the C library interface must be implemented first.
  • The final item, code relocation, concerns saving compiled code to disk. When the image is saved, compiled code blocks require associated relocation information, so that if later the image is loaded at a different base address, any absolute addresses and references to symbols in the Factor runtime or C libraries can be fixed up. The runtime currently only supports word-size relocation; so far this has been sufficient, but for AMD64 the ability to fix up both 32-bit and 64-bit addresses is required. This will not involve a lot of code, but might be a pain to debug; code relocation is always tricky because it is easy to screw up, and not always obvious how to fix things when they go wrong -- the image just crashes on startup because some random instruction got clobbered.


I hope to finish the port within the next week or so; after this is done, Factor will compile native code on three CPU architectures, and I will have mastered the assembly language for all three. Compiler-wise, my next project is rewriting the translation from dataflow IR to linear IR (VOPs); right now a lot of information in the dataflow IR is lost, resulting in sub-optimal register usage and VOP selection. Once this is done, I will implement floating point and complex number VOPs using the SSE2 vector math extensions for x86 and AMD64; this should result in a substantial performance gain for numerical code.

However, first, I would like to finish the new help system that I started, but put aside. I put it aside for two reasons; I wanted to finish the AMD64 port first, so that I can develop Factor on my Athlon instead of the slow-as-fuck Mac Mini, and also, because I am stuck on how best to design an extended formatted stream protocol. This concerns a generic way to perform word wrap, indentation, and tabular output. Right now the prettyprinter and inspector have some ad-hoc code to do this, however I don't want to duplicate it yet again for the help system's styled text engine; I'd rather consolidate it in one place with a generic protocol allowing textual, HTML and GUI output.

No comments: