Tuesday, September 02, 2008

Clearing out the dead wood

Old accessors removed

Back in the day, when you defined
TUPLE: person name age ;

You'd get a set of words person-name, person-age, set-person-name, set-person-age. The "new accessors" language change instead defined methods on the words name>>, age>>, >>name, >>age. More concise code, and more opportunities for polymorphism (you can write a word which calls name>> and pass it anything with that slot, not just a person) and less stack shuffling, since the new setter effect seems better in practice.

While most new code used new accessors, we still had some older code which used the old accessors. With very little help from me, Doug and Ed updated all of the remaining code in the repository to use new accessors and even removed them from the core.

Removing old accessors reduced the image size from 43Mb to 36Mb on Mac OS X x86, and correspondingly on other platforms. The reduced size of 36Mb seems large, however this is the default development image and it includes a lot of stuff, such as all the documentation, tools, and the UI. Nevertheless, it is still on the large side and I'm sure I will find tricks in the future to reduce its size. It is just that for the most part my focus has been on performance, features and bug fixes. Optimizing Factor for memory-constrained environments is another whole sub-project, something I'll tackle eventually.

In addition to image size, long bootstrap time is another contributing factor (pun intended) the impression some people have of Factor being bloated. Bootstrap time has improved recently too; the new optimizer is a lot faster than the old one, and removing old accessors reduced the number of words that need to be compiled which brought it down another notch. A week ago bootstrap took 8 minutes on my MacBook, and now it is down to around 4 minutes 20 seconds. Another 2x improvement would be nice.

New equality semantics on numbers

A change I've been contemplating for a while is changing the semantics of equality on numbers to only consider two numbers equal if they have the same value and the same type. I finally applied this change today. This means that 3 3.0 = would evaluate to true before, but now it is false. If you want the old semantics, the number= word still exists; 3 3.0 number= evaluates to true.

Performance improvements

The main reason I made the above change is to improve performance. First of all, there is less dispatching when numbers are compared for equality. Second of all, it helps with type inference of complex number arithmetic. This change implies that constructing a complex number with an imaginary part equal to floating point zero outputs a new complex number. Because of this, arithmetic operations on complex number inputs with floating point components always output complex numbers with floating point components, with no possibility of outputting a real number. This eliminates unnecessary dispatch.

Consider this word from the mandelbrot benchmark:
: iter ( c z nb-iter -- x )
dup 0 <= [ 2nip ] [
over absq 4.0 >= [ 2nip ] [
>r sq dupd + r> 1- iter
] if
] if ; inline recursive

The word was called from one place:
c C{ 0.0 0.0 } 60 iter

Before I changed equality semantics on numbers, there was very little that the compiler could do to optimize this, because all of the arithmetic operations would either output real or complex numbers. But now, since the inputs are known to be complex numbers with floating point components, so are the outputs, so all of the generic arithmetic dispatch can be eliminated. Furthermore, the new escape analysis pass unboxes all intermediate complex numbers. After inlining, dispatch elimination and unboxing, the above word turns into this monstrosity:
( scratchpad ) [ { float float } declare  C{ 0.0 0.0 } nb-iter iter ] optimized.
0.0 0.0 40 \ ( gensym ) [
dup 0 fixnum<=
[ ( 1262 1263 1264 1265 10 -- 10 ) ] [
( 1264 1265 10 -- 1264 1265 10 1264 1265 )
>r r>
>r dup float* r>
dup float* float+ 4.0 float>=
[ ( 1262 1263 1266 1267 18 -- 18 ) ] [
( 1278 1279 1280 1281 -- 1278 1279 1280 1281 1278 1279 1280 1281 )
( 1286 1287 1288 1289 -- 1286 1288 1289 1287 )
( 1292 1293 1295 -- 1292 1295 1293 )
>r r>
>r r>
>r float* r>
( 1282 1283 1284 1285 769 -- 769 1282 1283 1284 1285 )
( 1322 1323 1324 1325 -- 1322 1324 1325 1323 )
( 1328 1329 1331 -- 1328 1331 1329 )
>r r>
>r r>
>r float* swap r>
float* float+
( 1262 1263 1362 1363 -- 1262 1263 1262 1263 1362 1363 )
( 1366 1367 1368 1369 -- 1366 1368 1369 1367 )
( 1372 1373 1375 -- 1372 1375 1373 )
>r r>
>r r>
>r float+ r>
1 fixnum-fast ( gensym )
] if
] if
] label

Other than the stack shuffles, we see that the only words which remain are primitives such as float+ and float* so in fact, unboxed float arithmetic is taking place here.

The ( ... -- ... ) syntax is used by the prettyprinter for compiler IR to express stack shuffles for which no corresponding word exists. These come up as a result of other optimizations. In fact the overwhelming majority of the stack shuffles above are redundant, and they are eliminated by the code generator pass which runs after the optimizer. Compare with copy propagation and coalescing in compilers for applicative languages; optimizers are written to freely introduce redundant copies of values, with the knowledge that a subsequent pass eliminates them, because it simplifies the implementation.

So indeed, the above output is close to optimal as far as the optimizer is concerned. The code generator could still do register allocation more efficiently and generate better assembly sequences for float arithmetic, but this is a project for later.

The mandelbrot benchmark now runs in about 350 milliseconds, compared to 1.8 seconds before. This is a huge improvement, and it was all thanks to the new optimizer and the minor change to numeric equality semantics. I remember when the mandelbrot benchmark would take 20 seconds to run; it is one of the oldest benchmarks for Factor and eliminating all of the memory allocation and dispatch from the inner loop above has literally taken 4 years of work.

1 comment:

Max said...

About accessors. As I see, there is way to reduce accessors size overhead even move. What if we make only one accessor per slot that simply gets index of that slot in object array:

.slot (object -- object index)

And then use allready defined array setters and getters. Like this:

object .slot data set
object .slot get

I think, one accessor per slot is better than two.