Monday, October 09, 2006

Three cheers for C syntax

I found this lovely snippet in a C application which shall remain nameless:
void interpreter( void ) { for(;;) (*(prim *)(intptr_t)*(cell *)(intptr_t)(*++lc))(); }

So, which pairs of brackets correspond to casts, grouped expressions, and function invocations? What is the operator precedence between casts, dereferencing, and pre-increment? I really don't know.

Here is the PowerPC disassembly:
0x00002488 :     mflr    r0
0x0000248c : bcl- 20,4*cr7+so,0x2490 interpreter+8
0x00002490 : stmw r30,-8(r1)
0x00002494 : mflr r31
0x00002498 : stw r0,8(r1)
0x0000249c : stwu r1,-80(r1)
0x000024a0 : addis r2,r31,0
0x000024a4 : lwz r30,23748(r2)
0x000024a8 : lwz r2,0(r30)
0x000024ac : addi r2,r2,8
0x000024b0 : stw r2,0(r30)
0x000024b4 : lwz r9,4(r2)
0x000024b8 : lwz r0,4(r9)
0x000024bc : mr r12,r0
0x000024c0 : mtctr r0
0x000024c4 : bctrl
0x000024c8 : b 0x24a8 interpreter+32

Now if I wanted to figure things out, I could spend 5 minutes going through every instruction and keep track of registers with a paper and pen, but at least its possible. Not so with the C code.


Anonymous said...

That's a little harsh. I'm sure you can do just as bad in Factor; I know you can do much worse in APL.

Slava Pestov said...

In Factor and APL you could write code with pretty opaque semantics, but the syntax is clear. This C snippet doesn't even parse in my head for me.

Anonymous said...

I agree that C syntax can be messy but with a little bit of knowledge you can parse this code without much difficulty. I've noticed you express an aversion to C pointers before Slava so perhaps that's why you don't like this code!

You've kind of asked a trick question coz believe it or not, there's actually no precedence or associativity rules applying in this code.

void interpreter( void ) { for(;;) (*(prim *)(intptr_t)*(cell *)(intptr_t)(*++lc))(); }

We don't get to see what prim and cell are but the only thing that prim* (and cell*) can be is a type-name and hence (prim *) is a type cast. Although prim* is not a simple identifier it is still a type-name as defined by the C standard. Presumably intptr_t is pointer to int so (intptr_t) is a cast to int*. The intptr_t casts in this code are actually completely redundant.

For the *++lc part, the dereference operator (*) requires an expression of type "pointer" as its operand, so following the *, the compiler looks for an expression. ++ is not an expression but ++lc is an expression and hence ++lc is the operand of the de-reference operator i.e. the ++ is first applied to lc and the de-reference operator is applied to the result. There is no precedence or associativity rule applying here. (If the code were ++*lc then the asterisk would be applied first even though ++ has higher precedence.)

We can rewrite the original expression as

(* (typename1)(typename2) * (typename3)(typename4)(*++lc)) ()

You can see intuitively that first *++lc is evaluated, then typename4 cast is applied, then typename3 cast, then the de-reference asterisk following typename2, then typename2 cast, then typename1 cast, then de-reference, and finally a function call.

The leftmost asterisk operator requires an expression (with pointer type) as its operand. The compiler parses all the way to the parentheses following lc before it finds a valid expression and this is what gets de-referenced.

An example of when precedence applies is with *lc++. The asterisk operator requires an expression as its operand - lc by itself is a valid expression and so is lc++, so the compiler uses a precedence rule to decide that the ++ is applied before the asterisk.
You can see the precedence table for C/C++ here

Most or all precedence tables you'll see show unary operators as having right to left associativity but this is kind of misleading because for adjacent unary operators, it's either the lexical order or the precedence that determines which operator is applied first.
In the case of C++, the standard doesn't directly specify associativity and precedence in a table, but rather as a set of production rules e.g. the production rules for addition and multiplication are


additiveexpression + multiplicativeexpression

additiveexpression - multiplicativeexpression


multiplicative-expression * pmexpression

multiplicative-expression / pmexpression

multiplicative-expression % pmexpression

and if you think hard, you can see these rules mean multiply has higher precedence than addition and that in a+b+c, or a*b*c, the operators are left associative.

C, it's easy really :)


Anonymous said...

oops, I was bending the facts to fit my argument when I said that prefix ++ has higher precedence than de-refereence (*). Unary operators and typecast all have the same precedence and you can think of them as associating right to left if you want, as precedence tables usually show.

Precedence tables often have slight mistakes in them - on this web site
you can see there's no distinction between post and pre increment and they're both shown as associating right to left whereas on wikipedia you can see post inc/dec shown with higher precedence than pre-inc/dec and associating left to right (wikipedia is correct and it points out this is the same as Java). Post inc/dec are regarded as "postfix operators" and not as unary operators.

In C, it actually makes no difference whether postfix inc/dec are given higher or the same precedence as unary operators and right to left associativity because if you write
with right to left associativity, the ++ gets done first (and the result is an rvalue) and it's actually illegal to write
because ++ and -- require an lvalue as an argument and postfix ++ -- produce an rvalue.
This is unlike C++ where these operators can be overloaded and the precedence and associativity do matter.