<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-17087850</id><updated>2012-01-03T07:30:39.950-05:00</updated><title type='text'>Factor: a practical stack language</title><subtitle type='html'>&lt;a href="http://factorcode.org/slava"&gt;Slava Pestov&lt;/a&gt;'s weblog, primarily about &lt;a href="http://factorcode.org"&gt;Factor&lt;/a&gt;.</subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://factor-language.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://factor-language.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><link rel='next' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default?start-index=101&amp;max-results=100'/><author><name>Slava Pestov</name><uri>http://www.blogger.com/profile/02768382790667979877</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>521</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-17087850.post-2482709033725182854</id><published>2010-09-18T23:40:00.003-04:00</published><updated>2010-09-18T23:54:35.251-04:00</updated><title type='text'>Factor 0.94 now available</title><content type='html'>&lt;p&gt;Factor 0.94 is now available from the &lt;a href="http://factorcode.org"&gt;Factor website&lt;/a&gt;, five months after the previous release, Factor 0.93. Binaries are provided for 10 platforms.&lt;/p&gt; &lt;p&gt;As usual, contributors did most of the work. Thanks to Daniel Ehrenberg, Dmitry Shubin, Doug Coleman, Erik Charlebois, Joe Groff, John Benediktsson, Jose A. Ortega Ruiz, Niklas Waern, Samuel Tardieu, Sascha Matzke and everyone else who helped out this time around!&lt;/p&gt; &lt;h3&gt;Incompatible changes:&lt;/h3&gt; &lt;ul&gt; &lt;li&gt;The PowerPC architecture is no longer supported. (Slava Pestov)&lt;/li&gt; &lt;li&gt;The &lt;a href="http://docs.factorcode.org/content/word-require-when%2Cvocabs.loader.html"&gt;require-when&lt;/a&gt; word now supports dependencies on multiple vocabularies. (Daniel Ehrenberg)&lt;/li&gt; &lt;li&gt;The &lt;code&gt;C-ENUM:&lt;/code&gt; word in the C library interface has been replaced with &lt;a href="http://docs.factorcode.org/content/word-ENUM__colon__%2Calien.syntax.html"&gt;ENUM:&lt;/a&gt;, a much improved word for defining type-safe enumerations. (Erik Charlebois, Joe Groff)&lt;/li&gt; &lt;li&gt;Tuple slot setter words with stack effect &lt;code&gt;( value object -- )&lt;/code&gt; are now named &lt;code&gt;foo&amp;lt;&amp;lt;&lt;/code&gt; instead of &lt;code&gt;(&gt;&gt;foo)&lt;/code&gt;. Most code is unaffected since it uses the &lt;code&gt;&gt;&gt;foo&lt;/code&gt; form. (Slava Pestov)&lt;/li&gt; &lt;li&gt;The older &lt;code&gt;system-micros&lt;/code&gt; word, which returned microseconds since the Unix epoch as an integer, has been removed. For a while, the recommended way to get the current time has been to call the &lt;code&gt;now&lt;/code&gt; word from the &lt;a href="http://docs.factorcode.org/content/vocab-calendar.html"&gt;calendar&lt;/a&gt; vocabulary, which returned a &lt;a href="http://docs.factorcode.org/content/word-timestamp%2Ccalendar.html"&gt;timestamp&lt;/a&gt; instance. (Doug Coleman)&lt;/li&gt; &lt;li&gt;A few sequence-related words were moved from the &lt;a href="http://docs.factorcode.org/content/vocab-generalizations.html"&gt;generalizations&lt;/a&gt; vocabulary to &lt;a href="http://docs.factorcode.org/content/vocab-sequences.generalizations.html"&gt;sequences.generalizations&lt;/a&gt;. (Slava Pestov)&lt;/li&gt; &lt;li&gt;The &lt;code&gt;alarms&lt;/code&gt; vocabulary has been renamed to &lt;a href="http://docs.factorcode.org/content/vocab-timers.html"&gt;timers&lt;/a&gt; to better explain its true purpose, with improved timing accuracy and robustness. (Doug Coleman)&lt;/li&gt; &lt;li&gt;&lt;a href="http://docs.factorcode.org/content/vocab-cocoa.subclassing.html"&gt;cocoa.subclassing&lt;/a&gt;: the syntax for defining new Objective-C classes has been changed to improve readability. (Slava Pestov)&lt;/li&gt; &lt;li&gt;&lt;a href="http://docs.factorcode.org/content/vocab-io.streams.limited.html"&gt;io.streams.limited&lt;/a&gt;: the ability to throw errors on EOF was extracted from limited streams, and limited streams simplified as a result. Throwing on EOF is now implemented by the &lt;a href="http://docs.factorcode.org/content/vocab-io.streams.throwing.html"&gt;io.streams.throwing&lt;/a&gt; vocabulary. (Doug Coleman)&lt;/li&gt; &lt;/ul&gt; &lt;h3&gt;New libraries:&lt;/h3&gt; &lt;ul&gt; &lt;li&gt;&lt;a href="http://docs.factorcode.org/content/vocab-boyer-moore.html"&gt;boyer-moore&lt;/a&gt;: efficient text search algorithm (Dmitry Shubin)&lt;/li&gt; &lt;li&gt;&lt;a href="http://docs.factorcode.org/content/vocab-checksums.internet.html"&gt;checksums.internet&lt;/a&gt;: implementation of checksum algorithm used by ICMP for the &lt;a href="http://docs.factorcode.org/content/vocab-checksums.html"&gt;checksums&lt;/a&gt; framework (John Benediktsson)&lt;/li&gt; &lt;li&gt;&lt;a href="http://docs.factorcode.org/content/vocab-gdbm.html"&gt;gdbm&lt;/a&gt;: disk-based hash library binding (Dmitry Shubin)&lt;/li&gt; &lt;li&gt;&lt;a href="http://docs.factorcode.org/content/vocab-io.encodings.detect.html"&gt;io.encodings.detect&lt;/a&gt;: binary file/text encoding detection heuristics from jEdit (Joe Groff)&lt;/li&gt; &lt;li&gt;&lt;a href="http://docs.factorcode.org/content/vocab-javascriptcore.html"&gt;javascriptcore&lt;/a&gt;: FFI to the WebKit JavaScript engine (Doug Coleman)&lt;/li&gt; &lt;li&gt;&lt;a href="http://docs.factorcode.org/content/vocab-lua.html"&gt;lua&lt;/a&gt;: FFI to the Lua scripting language (Erik Charlebois)&lt;/li&gt; &lt;li&gt;&lt;a href="http://docs.factorcode.org/content/vocab-oauth.html"&gt;oauth&lt;/a&gt;: minimal implementation of client-side OAuth (Slava Pestov)&lt;/li&gt; &lt;li&gt;&lt;a href="http://docs.factorcode.org/content/vocab-sequences.unrolled.html"&gt;sequences.unrolled&lt;/a&gt;: efficient unrolled loops with constant iteration count (Joe Groff)&lt;/li&gt; &lt;/ul&gt; &lt;h3&gt;Improved libraries:&lt;/h3&gt; &lt;ul&gt; &lt;li&gt;&lt;a href="http://docs.factorcode.org/content/vocab-cuda.html"&gt;cuda&lt;/a&gt;: various improvements (Joe Groff, Doug Coleman)&lt;/li&gt; &lt;li&gt;&lt;a href="http://docs.factorcode.org/content/vocab-game.input.html"&gt;game.input&lt;/a&gt;: now uses XInput2 on X11 (Niklas Waern)&lt;/li&gt; &lt;li&gt;&lt;a href="http://docs.factorcode.org/content/vocab-gpu.render.html"&gt;gpu.render&lt;/a&gt;, &lt;a href="http://docs.factorcode.org/content/vocab-gpu.buffers.html"&gt;gpu.buffers&lt;/a&gt;: various improvements (Joe Groff)&lt;/li&gt; &lt;li&gt;&lt;a href="http://docs.factorcode.org/content/vocab-math.combinatorics.html"&gt;math.combinatorics&lt;/a&gt;: various improvements (John Benediktsson)&lt;/li&gt; &lt;li&gt;&lt;a href="http://docs.factorcode.org/content/vocab-math.primes.html"&gt;math.primes&lt;/a&gt;: various improvements (Samuel Tardieu)&lt;/li&gt; &lt;li&gt;&lt;a href="http://docs.factorcode.org/content/vocab-math.vectors.simd.cords.html"&gt;math.vectors.simd.cords&lt;/a&gt;: compound "256-bit" SIMD types now support the full set of SIMD operations (Joe Groff)&lt;/li&gt; &lt;li&gt;&lt;a href="http://docs.factorcode.org/content/vocab-mongodb.html"&gt;mongodb&lt;/a&gt;: various improvements (Sascha Matzke)&lt;/li&gt; &lt;li&gt;&lt;a href="http://docs.factorcode.org/content/vocab-twitter.html"&gt;twitter&lt;/a&gt;: now uses OAuth as required by Twitter (Slava Pestov)&lt;/li&gt; &lt;li&gt;&lt;a href="http://docs.factorcode.org/content/vocab-unix.users.html"&gt;unix.users,&lt;/a&gt; &lt;a href="http://docs.factorcode.org/content/vocab-unix.groups.html"&gt;unix.groups&lt;/a&gt;: various improvements (Doug Coleman)&lt;/li&gt; &lt;/ul&gt; &lt;h3&gt;Compiler improvements:&lt;/h3&gt; &lt;ul&gt; &lt;li&gt;Improved instruction selection, copy propagation, representation selection, and register allocation; details in a &lt;a href="http://factor-language.blogspot.com/2010/05/collection-of-small-compiler.html"&gt;blog post&lt;/a&gt;. (Slava Pestov) &lt;li&gt;An instruction scheduling pass now runs prior to register allocation, intended to reduce register pressure by moving uses closer to definitions; details in a &lt;a href="http://useless-factor.blogspot.com/2010/02/instruction-scheduling-for-register.html"&gt;blog post&lt;/a&gt;. (Daniel Ehrenberg)&lt;/li&gt; &lt;li&gt;The code generation for the C library interface has been revamped; details in a &lt;a href="http://factor-language.blogspot.com/2010/07/overhauling-factors-c-library-interface.html"&gt;blog post&lt;/a&gt;. (Slava Pestov) &lt;li&gt;Something similar to what C++ and C# programmers refer to as "value types"; binary data can now be allocated on the call stack, and passed to C functions like any other pointer. The &lt;a href="http://docs.factorcode.org/content/word-with-out-parameters%2Calien.data.html"&gt;with-out-parameters&lt;/a&gt; combinator replaces tricky code for allocating and juggling multiple temporary byte arrays used as out parameters for C function calls, making this idiom easier to read and more efficient. The &lt;a href="http://docs.factorcode.org/content/word-with-scoped-allocation,alien.data.html"&gt;with-scoped-allocation&lt;/a&gt; combinator presents a more general, lower-level interface. (Slava Pestov)&lt;/li&gt; &lt;li&gt;The compiler can now use the x87 floating point unit on older CPUs where SSE2 is not available. However, this is not recommended, because the build farm does not test Factor (or build any binaries) in x87 mode, so this support can break at any time. To use x87 code generation, you must download the source code and bootstrap Factor yourself, on a CPU without SSE2. (Slava Pestov)&lt;/li&gt; &lt;/ul&gt; &lt;h3&gt;Miscellaneous improvements:&lt;/h3&gt; &lt;ul&gt; &lt;li&gt;&lt;code&gt;fuel&lt;/code&gt;: Factor's Ultimate Emacs Library has seen many improvements, and also some keyboard shortcuts have changed; see the &lt;a href=""&gt;README&lt;/a&gt;. (Erik Charlebois, Dmitry Shubin, Jose A. Ortega Ruiz)&lt;/li&gt; &lt;li&gt;A new &lt;code&gt;factor.cmd&lt;/code&gt; script is now included in the &lt;code&gt;build-support&lt;/code&gt; directory, to automate the update/build/bootstrap cycle for those who build from source. Its functionality is a subset of the &lt;code&gt;factor.sh&lt;/code&gt; script for Unix. (Joe Groff)&lt;/li&gt; &lt;li&gt;The default set of icons shipped in &lt;code&gt;misc&lt;/code&gt; has been tweaked, with better contrast and improved appearance when scaled down. (Joe Groff)&lt;/li&gt; &lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17087850-2482709033725182854?l=factor-language.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://factor-language.blogspot.com/feeds/2482709033725182854/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17087850&amp;postID=2482709033725182854' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/2482709033725182854'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/2482709033725182854'/><link rel='alternate' type='text/html' href='http://factor-language.blogspot.com/2010/09/factor-094-now-available.html' title='Factor 0.94 now available'/><author><name>Slava Pestov</name><uri>http://www.blogger.com/profile/02768382790667979877</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17087850.post-3219921363760945818</id><published>2010-09-11T22:23:00.006-04:00</published><updated>2010-09-13T23:53:29.344-04:00</updated><title type='text'>An overview of Factor's I/O library</title><content type='html'>&lt;p&gt;Factor has grown a very powerful and high-level I/O library over the years, however it is easy to get lost in the forest of reference documentation surrounding the &lt;a href="http://docs.factorcode.org/content/vocab-io.html"&gt;io&lt;/a&gt; vocabulary hierarchy. In this blog post I'm attempting to give an overview of the functionality available, with some easy-to-digest examples, along with links for futher reading. I will also touch upon some common themes that come up throughout the library, such as encoding support, timeouts, and uses for dynamically-scoped variables.&lt;/p&gt; &lt;p&gt;Factor's I/O library is the work of many contributors over the years. Implementing FFI bindings to native I/O APIs, developing high-level abstractions on top, and making the whole thing cross-platform takes many people. In particular &lt;a href="http://docs.factorcode.org/content/author-Doug%20Coleman.html"&gt;Doug Coleman&lt;/a&gt; did a lot of heavy lifting early on for the Windows port, and also implemented several new cross-platform features such as file system metadata and memory mapped files.&lt;/p&gt; &lt;p&gt;Our I/O library is competitive with Python's APIs and Java's upcoming NIO2 in breadth of functionality. I like to think the design is quite a bit cleaner too, because instead of being a thin wrapper over POSIX we try to come up with clear and conherent APIs that make sense on both Windows and Unix.&lt;/p&gt; &lt;h3&gt;First example: converting a text file from MacRoman to UTF8&lt;/h3&gt; &lt;p&gt;The &lt;a href="http://docs.factorcode.org/content/vocab-io.files.html"&gt;io.files&lt;/a&gt; vocabulary defines words for reading and writing files. It supports two modes of operation in a pretty standard fashion:&lt;/p&gt; &lt;ul&gt; &lt;li&gt;Stream-based: &lt;a href="http://docs.factorcode.org/content/word-with-file-reader,io.files.html"&gt;with-file-reader&lt;/a&gt;, &lt;a href="http://docs.factorcode.org/content/word-with-file-writer,io.files.html"&gt;with-file-writer&lt;/a&gt;&lt;/li&gt; &lt;li&gt;Entire file: &lt;a href="http://docs.factorcode.org/content/word-file-contents,io.files.html"&gt;file-contents&lt;/a&gt;, &lt;a href="http://docs.factorcode.org/content/word-set-file-contents,io.files.html"&gt;set-file-contents&lt;/a&gt;, &lt;a href="http://docs.factorcode.org/content/word-file-lines,io.files.html"&gt;file-lines&lt;/a&gt;, &lt;a href="http://docs.factorcode.org/content/word-set-file-lines,io.files.html"&gt;set-file-lines&lt;/a&gt;&lt;/li&gt; &lt;/ul&gt; &lt;p&gt;What makes Factor's file I/O interesting is that it takes advantage of pervasive support for I/O encoding. In Factor, a string is not a sequence of bytes; it is a sequence of Unicode code points. When reading and writing strings on external resources, which only consist of bytes, an encoding parameter is given to specify the conversion from strings to byte arrays.&lt;/p&gt; &lt;p&gt;Let's convert &lt;code&gt;foo.txt&lt;/code&gt; from MacRoman, an older encoding primarily used by classic Mac OS, to UTF8:&lt;/p&gt; &lt;pre&gt;USING: io.encodings.8-bit.mac-roman io.encodings.utf8 io.files ;&lt;br /&gt;&lt;br /&gt;"foo.txt" mac-roman file-contents&lt;br /&gt;"foo.txt" utf8 set-file-contents&lt;/pre&gt; &lt;p&gt;This is a very simple and concise implementation but it has the downside that the entire file is read into memory. For most small text files this does not matter, but if efficiency is a concern then we can do the conversion a line at a time:&lt;/p&gt; &lt;pre&gt;USING: io io.encodings.8-bit.mac-roman io.encodings.utf8&lt;br /&gt;io.files ;&lt;br /&gt;&lt;br /&gt;"out.txt" utf8 [&lt;br /&gt;    "in.txt" mac-roman [&lt;br /&gt;        [ print ] each-line&lt;br /&gt;    ] with-file-reader&lt;br /&gt;] with-file-writer&lt;/pre&gt; &lt;h3&gt;Converting a directory full of files from MacRoman to UTF8&lt;/h3&gt; &lt;p&gt;The &lt;a href="http://docs.factorcode.org/content/vocab-io.files.html"&gt;io.files&lt;/a&gt; vocabulary defines words for listing and modifying directories. Let's make the above example more interesting by performing the conversion on a directory full of files:&lt;/p&gt; &lt;pre&gt;USING: io.directories io.encodings.8-bit.mac-roman&lt;br /&gt;io.encodings.utf8 io.files ;&lt;br /&gt;&lt;br /&gt;: convert-directory ( path -- )&lt;br /&gt;    [&lt;br /&gt;        [&lt;br /&gt;            [ mac-roman file-contents ] keep&lt;br /&gt;            utf8 set-file-contents&lt;br /&gt;        ] each&lt;br /&gt;    ] with-directory-files ;&lt;/pre&gt; &lt;h3&gt;An aside: generalizing the "current working directory"&lt;/h3&gt; &lt;p&gt;If you run the following, you will see that &lt;code&gt;with-directory-files&lt;/code&gt; returns relative, and not absolute, file names:&lt;/p&gt; &lt;pre&gt;"/path/to/some/directory"&lt;br /&gt;[ [ print ] each ] with-directory-files&lt;/pre&gt; &lt;p&gt;So the question is, how did &lt;code&gt;file-contents&lt;/code&gt; above know what directory to look for files in? The answer is that in addition to calling the quotation with the directory's contents, the &lt;code&gt;with-directory-files&lt;/code&gt; word also rebinds the &lt;a href="http://docs.factorcode.org/content/word-current-directory,io.pathnames.html"&gt;current-directory&lt;/a&gt; dynamic variable.&lt;/p&gt; &lt;p&gt;This directory is the Factor equivalent of the familiar Unix notion of "current working directory". It generalizes the Unix feature by making it dynamically-scoped; within the quotation passed to the &lt;code&gt;with-directory&lt;/code&gt; combinator, relative paths are resolved relative to that directory, but other coroutines executing at the time, or code after the quotation, is unaffected. This functionality is implemented entirely at the library level; all pathname strings are "normalized" with the &lt;code&gt;normalize-pathname&lt;/code&gt; word before being handed off to the operating system.&lt;/p&gt; &lt;p&gt;When calling a shell command with &lt;code&gt;io.launcher&lt;/code&gt;, the child process is run from the Factor &lt;code&gt;current-directory&lt;/code&gt; so relative pathnames passed on the command line will just work. However, when making C FFI calls which take pathnames, you pass in absolute paths only, or normalize the path with &lt;code&gt;normalize-path&lt;/code&gt; first, otherwise the C code wlll search for it in the wrong place.&lt;/p&gt; &lt;h3&gt;Checking free disk space&lt;/h3&gt; &lt;p&gt;The &lt;a href="http://docs.factorcode.org/content/vocab-io.files.info.html"&gt;io.files.info&lt;/a&gt; vocabulary defines two words which return tuples containing information about a file, and the file system containing the file, respectively: &lt;ul&gt; &lt;li&gt;&lt;a href="http://docs.factorcode.org/content/word-file-info,io.files.info.html"&gt;file-info&lt;/a&gt;&lt;/li&gt; &lt;li&gt;&lt;a href="http://docs.factorcode.org/content/word-file-system-info,io.files.info.html"&gt;file-system-info&lt;/a&gt;&lt;/li&gt; &lt;/ul&gt; &lt;p&gt;Let's say your application needs to install some files in the user's home directory, but instead of failing half-way through in the event that there is insufficient space, you'd rather display a friendly error message upfront:&lt;/p&gt; &lt;pre&gt;ERROR: buy-a-new-disk ;&lt;br /&gt;&lt;br /&gt;: gb ( m -- n ) 30 2^ * ;&lt;br /&gt;&lt;br /&gt;: check-space ( -- )&lt;br /&gt;    home file-system-info free-space&gt;&gt; 10 gb &lt;&lt;br /&gt;    [ buy-a-new-disk ] when ;&lt;/pre&gt; &lt;p&gt;Now if there is less than 10gb available, the &lt;code&gt;check-space&lt;/code&gt; word will throw a &lt;code&gt;buy-a-new-disk&lt;/code&gt; error.&lt;/p&gt; &lt;p&gt;The &lt;code&gt;file-system-info&lt;/code&gt; word reports a bunch of other info. There is a Factor implementation of the Unix &lt;code&gt;df&lt;/code&gt; command in the &lt;a href="http://docs.factorcode.org/content/vocab-tools.files.html"&gt;tools.files&lt;/a&gt; vocabulary: &lt;pre&gt;( scratchpad ) file-systems.&lt;br /&gt;+device-name+ +available-space+ +free-space+ +used-space+ +total-space+ +percent-used+ +mount-point+&lt;br /&gt;/dev/disk0s2  15955816448       16217960448  183487713280 199705673728  91             /&lt;br /&gt;fdesc         0                 0            1024         1024          100            /dev&lt;br /&gt;fdesc         0                 0            1024         1024          100            /dev&lt;br /&gt;map -hosts    0                 0            0            0             0              /net&lt;br /&gt;map auto_home 0                 0            0            0             0              /home&lt;br /&gt;/dev/disk1s2  15922262016       15922262016  383489052672 399411314688  96             /Users/slava&lt;/pre&gt; &lt;p&gt;Doug has two blog posts about these features, &lt;a href="http://code-factor.blogspot.com/2009/01/files-and-file-systems-in-factor-part-1.html"&gt;part 1&lt;/a&gt; and &lt;a href="http://code-factor.blogspot.com/2009/01/files-and-file-systems-in-factor-part-2.html"&gt;part 2&lt;/a&gt;.&lt;/p&gt; &lt;h3&gt;Unix only: symbolic links&lt;/h3&gt; &lt;p&gt;Factor knows about symbolic links on Unix. The &lt;a href="http://docs.factorcode.org/content/vocab-io.files.links.html"&gt;io.files.links&lt;/a&gt; vocabulary defines a pair of words, &lt;a href="http://docs.factorcode.org/content/word-make-link,io.files.links.html"&gt;make-link&lt;/a&gt; and &lt;a href="http://docs.factorcode.org/content/word-make-hard-link,io.files.links.html"&gt;make-hard-link&lt;/a&gt;. The &lt;a href="http://docs.factorcode.org/content/word-link-info,io.files.info.html"&gt;link-info&lt;/a&gt; word is like &lt;a href="http://docs.factorcode.org/content/word-file-info,io.files.info.html"&gt;file-info&lt;/a&gt; except it doesn't follow symbolic links. Finally, the &lt;a href="http://docs.factorcode.org/content/vocab-io.directories.hierarchy.html"&gt;directory hierarchy traversal words&lt;/a&gt; don't follow links, so a link cycle or bogus link to / somewhere won't break everything.&lt;/p&gt; &lt;h3&gt;File system monitoring&lt;/h3&gt; &lt;p&gt;The &lt;a href="http://docs.factorcode.org/content/vocab-io.monitors.html"&gt;io.monitors&lt;/a&gt; vocabulary implements real-time file and directory change monitoring. Unfortunately at this point in time, it is only supported on Windows, Linux and Mac. Neither one of FreeBSD and OpenBSD exposes the necessary information to user-space.&lt;/p&gt; &lt;p&gt;Here is an example for watching a directory for changes, and logging them:&lt;/p&gt; &lt;pre&gt;USE: io.monitors&lt;br /&gt;&lt;br /&gt;: watch-loop ( monitor -- )&lt;br /&gt;    dup next-change path&gt;&gt; print flush watch-loop ;&lt;br /&gt;&lt;br /&gt;: watch-directory ( path -- )&lt;br /&gt;    [ t [ watch-loop ] with-monitor ] with-monitors ;&lt;/pre&gt; &lt;p&gt;Try pasting the above code into a Factor listener window, and then run &lt;code&gt;home watch-directory&lt;/code&gt;. Every time a file in your home directory is modified, its full pathname will be printed in the listener.&lt;/p&gt; &lt;p&gt;Java will only begin to support symbolic links and directory monitoring in the upcoming JDK7 release.&lt;/p&gt; &lt;h3&gt;Memory mapped files&lt;/h3&gt; &lt;p&gt;The &lt;a href="http://docs.factorcode.org/content/vocab-io.mmap.html"&gt;io.mmap&lt;/a&gt; vocabulary defines support for working with memory-mapped files. The highest-level and easiest to use interface is the &lt;a href="http://docs.factorcode.org/content/word-with-mapped-array,io.mmap.html"&gt;with-mapped-array&lt;/a&gt; combinator. It takes a file name, a &lt;a href="http://docs.factorcode.org/content/article-c-types-specs.html"&gt;C type&lt;/a&gt;, and a quotation. The quotation can perform generic sequence operations on the mapped file.&lt;/p&gt; &lt;p&gt;Here is an example which reverses each group of 4 bytes:&lt;/p&gt; &lt;pre&gt;USING: alien.c-types grouping io.mmap sequences&lt;br /&gt;specialized-arrays ;&lt;br /&gt;SPECIALIZED-ARRAY: char&lt;br /&gt;&lt;br /&gt;"mydata.dat" char [&lt;br /&gt;    4 &amp;lt;sliced-groups&gt;&lt;br /&gt;    [ reverse! drop ] each&lt;br /&gt;] with-mapped-array&lt;/pre&gt; &lt;p&gt;The &lt;a href="http://docs.factorcode.org/content/word-__lt__sliced-groups__gt__%2Cgrouping.html"&gt;&amp;lt;sliced-groups&gt;&lt;/a&gt; word returns a view of an underlying sequence, grouped into n-element subsequences. Mutating one of these subsequences in-place mutates the underlying sequence, which in our case is a mapped view of a file.&lt;/p&gt; &lt;p&gt;A more efficient implementation of the above is also possible, by mapping in the file as an &lt;code&gt;int&lt;/code&gt; array and then performing bitwise arithmetic on the elements.&lt;/p&gt; &lt;h3&gt;Launching processes&lt;/h3&gt; &lt;p&gt;Factor's &lt;a href="http://docs.factorcode.org/content/vocab-io.launcher.html"&gt;io.launcher&lt;/a&gt; vocabulary was originally developed for use by the &lt;a href="http://factor-language.blogspot.com/2010/09/making-factors-continuous-build-system.html"&gt;build farm&lt;/a&gt;. The build farm needs to launch processes with precise control over I/O redirection and timeouts, and so a rich set of cross-platform functionality was born.&lt;/p&gt; &lt;p&gt;The central concept in the library is the &lt;code&gt;process&lt;/code&gt;, tuple, constructed by calling &lt;code&gt;&amp;lt;process&gt;&lt;/code&gt;. Various slots of the process tuple can be filled in to specify the command line, environment variables, redirection, and so on. Then the process can be run in various ways, running in the foreground, in the background, or with input and output attached to Factor streams.&lt;/p&gt; &lt;p&gt;The launcher's I/O redirection is very flexible. If you don't touch the redirection slots in a process tuple, the subprocess will just inherit the current standard input and output. You can specify a file name to read or write from, a file name to append to, or even supply a pipe object, constructed from the &lt;a href="http://docs.factorcode.org/content/vocab-io.pipes.html"&gt;io.pipes&lt;/a&gt; vocabulary.&lt;/p&gt; &lt;pre&gt;&amp;lt;process&gt;&lt;br /&gt;    "rotate-logs" &gt;&gt;command&lt;br /&gt;    +closed+ &gt;&gt;stdin&lt;br /&gt;    "out.txt" &gt;&gt;stdout&lt;br /&gt;    "error.log" &amp;lt;appender&gt; &gt;&gt;stderr&lt;/pre&gt; &lt;p&gt;It is possible to specify a timeout when running a process:&lt;/p&gt; &lt;pre&gt;&amp;lt;process&gt;&lt;br /&gt;    { "ssh" "myhost" "-l" "jane" "do-calculation" } &gt;&gt;command&lt;br /&gt;    15 minutes &gt;&gt;timeout&lt;br /&gt;    "results.txt" &gt;&gt;stdout&lt;br /&gt;run-process&lt;/pre&gt; The process will be killed if it runs for longer than the timeout period. Many other features are supported; setting environment variables, setting process priority, and so on. The &lt;a href="http://docs.factorcode.org/content/vocab-io.launcher.html"&gt;io.launcher&lt;/a&gt; vocabulary has all the details. &lt;p&gt;Support for timeouts is a cross-cutting concern that touches many ports of the I/O API. This support is consolidated in the &lt;a href="http://docs.factorcode.org/content/vocab-io.timeouts.html"&gt;io.timeouts&lt;/a&gt; vocabulary. The &lt;code&gt;set-timeout&lt;/code&gt; generic word is supported by all external resources which provide interruptible blocking operations.&lt;/p&gt; &lt;p&gt;Timeouts are implemented on top of our &lt;a href="http://code-factor.blogspot.com/2009/11/monotonic-timers.html"&gt;monotonic timer support&lt;/a&gt;; changing your system clock while Factor is running won't screw with active network connections.&lt;/p&gt; &lt;h3&gt;Unix only: file ownership and permissions&lt;/h3&gt; &lt;p&gt;The &lt;code&gt;io.files.unix&lt;/code&gt; vocabulary defines words for reading and writing file ownership and permissions. Using this vocabulary, we can write a shell script to a file, make it executable, and run it. An essential component of any multi-language quine:&lt;/p&gt; &lt;pre&gt;USING: io.encodings.ascii io.files io.files.info.unix&lt;br /&gt;io.launcher ;&lt;br /&gt;&lt;br /&gt;"""&lt;br /&gt;#!/bin/sh&lt;br /&gt;echo "Hello, polyglot"&lt;br /&gt;""" "script.sh" ascii set-file-contents&lt;br /&gt;OCT: 755 "script.sh" set-file-permissions&lt;br /&gt;"./script.sh" run-process&lt;/pre&gt; &lt;p&gt;There are even more Unix-specific words in the &lt;a href="http://docs.factorcode.org/content/vocab-unix.users.html"&gt;unix.users&lt;/a&gt; and &lt;a href="http://docs.factorcode.org/content/vocab-unix.groups.html"&gt;unix.groups&lt;/a&gt; vocabularies. Using these words enables listing all users on the system, converting user names to UIDs and back, and even &lt;code&gt;setuid&lt;/code&gt; and &lt;code&gt;setgid&lt;/code&gt;.&lt;/p&gt; &lt;h3&gt;Networking&lt;/h3&gt; &lt;p&gt;Factor's &lt;a href="http://docs.factorcode.org/content/vocab-io.sockets.html"&gt;io.sockets&lt;/a&gt; vocabulary supports stream and packet-based networking.&lt;/p&gt; &lt;ul&gt; &lt;li&gt;Stream-based: &lt;a href="http://docs.factorcode.org/content/word-with-client,io.sockets.html"&gt;with-client&lt;/a&gt;, &lt;a href="http://docs.factorcode.org/content/word-__lt__server__gt__,io.sockets.html"&gt;&amp;lt;server&gt;&lt;/a&gt;, &lt;a href="http://docs.factorcode.org/content/word-accept,io.sockets.html"&gt;accept&lt;/a&gt;&lt;/li&gt; &lt;li&gt;Packet-based: &lt;a href="http://docs.factorcode.org/content/word-__lt__datagram__gt__,io.sockets.html"&gt;&amp;lt;datagram&gt;&lt;/a&gt;, &lt;a href="http://docs.factorcode.org/content/word-send%2Cio.sockets.html"&gt;send&lt;/a&gt;, &lt;a href="http://docs.factorcode.org/content/word-receive%2Cio.sockets.html"&gt;receive&lt;/a&gt;&lt;/li&gt; &lt;/ul&gt; &lt;p&gt;Network addresses are specified in a flexible manner. Specific classes exist for IPv4, IPv6 and Unix domain socket addressing. When a network socket is constructed, that endpoint is bound to a given address specifier.&lt;/p&gt; &lt;p&gt;Connecting to &lt;code&gt;http://www.apple.com&lt;/code&gt;, sending a GET request, and reading the result:&lt;/p&gt; &lt;pre&gt;USING: io io.encodings.utf8 io.sockets ;&lt;br /&gt;&lt;br /&gt;"www.apple.com" 80 &amp;lt;inet&gt; utf8 [&lt;br /&gt;    """GET / HTTP/1.1\r&lt;br /&gt;host: www.apple.com\r&lt;br /&gt;connection: close\r&lt;br /&gt;\r&lt;br /&gt;""" write flush&lt;br /&gt;    contents&lt;br /&gt;] with-client&lt;br /&gt;print&lt;/pre&gt; &lt;p&gt;SSL support is almost transparent; the only difference is that the address specifier is wrapped in &lt;code&gt;&amp;lt;secure&gt;&lt;/code&gt;:&lt;/p&gt; &lt;pre&gt;USING: io io.encodings.utf8 io.sockets&lt;br /&gt;io.sockets.secure ;&lt;br /&gt;&lt;br /&gt;"www.cia.gov" 443 &amp;lt;inet&gt; &amp;lt;secure&gt; utf8 [&lt;br /&gt;    """GET / HTTP/1.1\r&lt;br /&gt;host: www.cia.gov\r&lt;br /&gt;connection: close\r&lt;br /&gt;\r&lt;br /&gt;""" write flush&lt;br /&gt;    contents&lt;br /&gt;] with-client&lt;br /&gt;print&lt;/pre&gt; &lt;p&gt;For details, see the &lt;a href="http://docs.factorcode.org/content/vocab-io.sockets.secure.html"&gt;io.sockets.secure&lt;/a&gt; documentation, and my &lt;a href="http://factor-language.blogspot.com/2008/05/ssltls-support-added.html"&gt;blog post about SSL in Factor.&lt;/a&gt;.&lt;/p&gt; &lt;p&gt;Of course you'd never send HTTP requests directly using sockets; instead you'd use the &lt;a href="http://docs.factorcode.org/content/vocab-http.client.html"&gt;http.client&lt;/a&gt; vocabulary.&lt;/p&gt; &lt;h3&gt;Network servers&lt;/h3&gt; &lt;p&gt;Factor's &lt;a href="http://docs.factorcode.org/content/vocab-io.servers.connection.html"&gt;io.servers.connection&lt;/a&gt; vocabulary is so cool, that a couple of years back I made a &lt;a href="http://factor.blip.tv/file/1316060/"&gt;screencast about it&lt;/a&gt;. Nowadays the sample application developed in that screencast is in the &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=extra/time-server/time-server.factor;hb=HEAD"&gt;extra/time-server&lt;/a&gt;; the implementation is very concise and elegant.&lt;/p&gt; &lt;h3&gt;Under the hood&lt;/h3&gt; &lt;p&gt;All of this functionality is implemented in pure Factor code on top of our excellent &lt;a href="http://docs.factorcode.org/content/article-alien.html"&gt;C library interface&lt;/a&gt; and extensive bindings to POSIX and Win32 in the &lt;a href="http://docs.factorcode.org/content/vocab-unix.html"&gt;unix&lt;/a&gt; and &lt;a href="http://docs.factorcode.org/content/vocab-windows.html"&gt;windows&lt;/a&gt; vocabulary hierarchies, respectively.&lt;/p&gt; &lt;p&gt;As much as possible, I/O is performed with non-blocking operations; synchronous reads and writes only suspend the current coroutine and switch to the next runnable one rather than hanging the entire VM. I recently &lt;a href="http://factor-language.blogspot.com/2010/04/switching-call-stacks-on-different.html"&gt;rewrote the coroutines implementation to use direct context switching rather than continuations&lt;/a&gt;.&lt;/p&gt; &lt;p&gt;Co-ordination and scheduling of coroutines is handled with a series of &lt;a href="http://factor-language.blogspot.com/2008/02/some-changes-to-threads.html"&gt;simple concurrency abstractions&lt;/a&gt;.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17087850-3219921363760945818?l=factor-language.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://factor-language.blogspot.com/feeds/3219921363760945818/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17087850&amp;postID=3219921363760945818' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/3219921363760945818'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/3219921363760945818'/><link rel='alternate' type='text/html' href='http://factor-language.blogspot.com/2010/09/overview-of-factors-io-library.html' title='An overview of Factor&apos;s I/O library'/><author><name>Slava Pestov</name><uri>http://www.blogger.com/profile/02768382790667979877</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17087850.post-3585300047922066265</id><published>2010-09-05T20:50:00.006-04:00</published><updated>2010-09-06T02:11:35.333-04:00</updated><title type='text'>Making Factor's continuous build system more robust</title><content type='html'>&lt;p&gt;I've done some work on &lt;a href="http://concatenative.org/wiki/view/Factor/Build%20farm"&gt;Factor's continuous build system&lt;/a&gt; over the weekend to make it more robust in the face of failure, with improved error reporting and less manual intervention required to fix problems when they come up. The current build system is called "mason", because it is based on an earlier build system named "builder" that was written by Eduardo Cavazos. Every binary package you download from &lt;a href="http://factorcode.org"&gt;factorcode.org&lt;/a&gt; was built, tested and benchmarked by mason.&lt;/p&gt; &lt;h3&gt;Checking for disk space&lt;/h3&gt; &lt;p&gt;Every once in a while build machines run out of disk space. This is a condition that Git doesn't handle very gracefully; if a git pull fails half way through, it leaves the repository in an inconsistent state. Instead of failing during source code checkout or testing, mason now checks disk usage before attempting a build. If less than 1 Gb is free, it sends out a warning e-mail and takes no further action. Disk usage is also now part of every build report; for example, take a look at the &lt;a href="http://builds.factorcode.org/report?os=macosx&amp;cpu=x86.32"&gt;latest Mac OS X report&lt;/a&gt;. Finally, mason does a better job of cleaning up after itself when builds fail, reducing the rate of disk waste overall.&lt;/p&gt; &lt;p&gt;I must say the disk space check was very easy to implement using Doug Coleman's excellent cross-platform &lt;code&gt;file-system-info&lt;/code&gt; library. Factor's I/O libraries are top-notch and everything works as expected across all of the platforms we run on.&lt;/p&gt; &lt;h3&gt;Git repository corruption&lt;/h3&gt; &lt;p&gt;Git is not 100% reliable, and sometimes repositories will end up in a funny state. One cause is when the disk fills up in the middle of a pull, but it seems to happen in other cases too. For example, just a few days ago, our 64-bit Windows build machine started failing builds with the following error: &lt;pre&gt;From git://factorcode.org/git/factor&lt;br /&gt; * branch            master     -&gt; FETCH_HEAD&lt;br /&gt;Updating d386ea7..eece1e3&lt;br /&gt;&lt;br /&gt;error: Entry 'basis/io/sockets/windows/windows.factor' not uptodate. Cannot merge.&lt;/pre&gt; &lt;p&gt;Of course nobody actually edits files in the repository in question, its a clone of the official repo that gets updated every 5 minutes. Why git messed up here I have no clue, but instead of expecting software to be perfect, we can design for failure.&lt;/p&gt; &lt;p&gt;If a pull fails with a merge error, or if the working copy somehow ends up containing modified or untracked files, mason deletes the repository and clones it again from scratch, instead of just falling over and requiring manual intervention.&lt;/p&gt; &lt;h3&gt;Error e-mail throttling&lt;/h3&gt; &lt;p&gt;Any time mason encounters an error, such as not being able to pull from the Factor Git repository, disk space exhaustion, or intermittent network failure, it sends out an e-mail to &lt;a href="http://sourceforge.net/mailarchive/forum.php?forum_name=factor-builds"&gt;Factor-builds&lt;/a&gt;. Since it checks for new code every 5 minutes, this can get very annoying if there is a problem with the machine and nobody is able to fix it immediately; the &lt;a href="http://sourceforge.net/mailarchive/forum.php?forum_name=factor-builds"&gt;Factor-builds list&lt;/a&gt; would get spammed with hundreds of duplicate messages. Now, mason uses a heuristic to limit the number of error e-mails sent out. If two errors are sent within twenty minutes of each other, no further errors are sent for another 6 hours.&lt;/p&gt; &lt;h3&gt;More robust new code check&lt;/h3&gt; &lt;p&gt;Previously mason would initiate a build if a git pull had pulled in patches. This was insufficient though, because if a build was killed half way through, for example due to power failure or machine reboot, it would not re-attempt a build when it came back up until new patches were pushed. Now mason compares the latest git revision with the last one that was actually built to completion (whether or not there were errors).&lt;/p&gt; &lt;h3&gt;Build system dashboard&lt;/h3&gt; &lt;p&gt;I've put together &lt;a href="http://builds.factorcode.org/dashboard"&gt;a simple dashboard page&lt;/a&gt; showing build system status. Sometimes VMs will crash (FreeBSD is particularly flaky when running under VirtualBox, for example) and we don't always notice that a VM is down until several days after, when no builds are being uploaded. Since mason now sends out heartbeats every 5 minutes to a central server, it was easy to put together a dashboard showing which machines have not sent any heartbeats for a while. These machines are likely to be down. The dashboard also allows a build to be forced even if no new code was pushed to the repository; this is useful to test things out after changing machine configuration.&lt;/p&gt; &lt;p&gt;The dashboard nicely complements my earlier work on &lt;a href="http://factor-language.blogspot.com/2009/05/live-status-display-for-factors-build.html"&gt;live status display&lt;/a&gt; for the build farm.&lt;/p&gt; &lt;h3&gt;Conclusion&lt;/h3&gt; &lt;p&gt;I think mason is one of the most advanced continuous integration systems among open source language implementations, nevermind the less mainstream ones such as Factor. And thanks to Factor's advanced libraries, it is only 1600 lines of code. Here is a selection of the functionality from Factor's standard library used by mason:&lt;/p&gt; &lt;ul&gt; &lt;li&gt;&lt;a href="http://docs.factorcode.org/content/vocab-io.files.info.html"&gt;io.files.info&lt;/a&gt; - checking disk usage&lt;/li&gt; &lt;li&gt;&lt;a href="http://docs.factorcode.org/content/vocab-io.launcher.html"&gt;io.launcher&lt;/a&gt; - running processes, such as git, make, zip, tar, ssh, and of course the actual Factor instance being tested&lt;/li&gt; &lt;li&gt;&lt;a href="http://docs.factorcode.org/content/vocab-io.timeouts.html"&gt;io.timeouts&lt;/a&gt; - timeouts on network operations and child processes are invaluable; Factor's consistent and widely-used timeout API makes it easy&lt;/li&gt; &lt;li&gt;&lt;a href="http://docs.factorcode.org/content/vocab-http.client.html"&gt;http.client&lt;/a&gt; - downloading boot images, making POST requests to builds.factorcode.org for the live status display feature&lt;/li&gt; &lt;li&gt;&lt;a href="http://docs.factorcode.org/content/vocab-smtp.html"&gt;smtp&lt;/a&gt; - sending build report e-mails&lt;/li&gt; &lt;li&gt;&lt;a href="http://docs.factorcode.org/content/vocab-twitter.html"&gt;twitter&lt;/a&gt; - Tweeting binary upload notifications&lt;/li&gt;&lt;li&gt;&lt;a href="http://docs.factorcode.org/content/vocab-oauth.html"&gt;oauth&lt;/a&gt; - yes, Factor has a library to support the feared OAuth. Everyone complains about how hard OAuth is, but if you have easy to use libraries for HMAC, SHA1 and HTTP then it's no big deal at all.&lt;li&gt;&lt;a href="http://docs.factorcode.org/content/vocab-xml.syntax.html"&gt;xml.syntax&lt;/a&gt; - constructing HTML-formatted build reports using XML literal syntax&lt;/li&gt; &lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17087850-3585300047922066265?l=factor-language.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://factor-language.blogspot.com/feeds/3585300047922066265/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17087850&amp;postID=3585300047922066265' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/3585300047922066265'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/3585300047922066265'/><link rel='alternate' type='text/html' href='http://factor-language.blogspot.com/2010/09/making-factors-continuous-build-system.html' title='Making Factor&apos;s continuous build system more robust'/><author><name>Slava Pestov</name><uri>http://www.blogger.com/profile/02768382790667979877</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17087850.post-5808076548128250184</id><published>2010-09-03T23:57:00.003-04:00</published><updated>2010-09-04T15:06:56.438-04:00</updated><title type='text'>Two things every Unix developer should know</title><content type='html'>&lt;p&gt;Unix programming can be tricky. There are many subtleties many developers are not aware of. In this post, I will describe just two of them... my favorite Unix quirks, if you will.&lt;/p&gt; &lt;h3&gt;Interruptible system calls&lt;/h3&gt; &lt;p&gt;On Unix, any system call which blocks can potentially fail with an errno of &lt;code&gt;EINTR&lt;/code&gt;, which indicates that the caller must retry the system call. The &lt;code&gt;EINTR&lt;/code&gt; error can be raised at any time for any reason, so essentially &lt;i&gt;every&lt;/i&gt; I/O operation on a Unix system must be prepared to handle this error properly. Surprisingly to some, this includes the C standard library functions such as &lt;code&gt;fread()&lt;/code&gt;, &lt;code&gt;fwrite()&lt;/code&gt;, and so on.&lt;/p&gt; &lt;p&gt;For example, if you are writing a network server, then most of the time, you want to ignore the &lt;code&gt;SIGPIPE&lt;/code&gt; signal which is raised when the client closes its end of a socket. However, this ignored signal can cause some pending I/O in the server to return &lt;code&gt;EINTR&lt;/code&gt;.&lt;/p&gt; &lt;p&gt;A commonly-held belief is that setting the &lt;code&gt;SA_RESTART&lt;/code&gt; flag with the &lt;code&gt;sigaction()&lt;/code&gt; system call means that if that signal is delivered, system calls are restarted for you and &lt;code&gt;EINTR&lt;/code&gt; doesn't need to be handled. Unfortunately this is not true. The reason is that certain signals are unmaskable. For instance, on Mac OS X, if your process is blocking reading on standard input, and the user suspends the program by sending it &lt;code&gt;SIGSTOP&lt;/code&gt; (usually by pressing ^Z in the terminal), then upon resumption, your &lt;code&gt;read()&lt;/code&gt; call will immediately fail with &lt;code&gt;EINTR&lt;/code&gt;.&lt;/p&gt; &lt;p&gt;Don't believe me? The Mac OS X &lt;code&gt;cat&lt;/code&gt; program is not actually interrupt-safe, and has this bug. Run &lt;code&gt;cat&lt;/code&gt; with no arguments in a terminal, press ^Z, then type &lt;code&gt;%1&lt;/code&gt;, and you'll get an error from cat!&lt;/p&gt; &lt;pre&gt;$ cat&lt;br /&gt;^Z&lt;br /&gt;[1]+  Stopped                 cat&lt;br /&gt;$ %1&lt;br /&gt;cat&lt;br /&gt;cat: stdin: Interrupted system call&lt;br /&gt;$&lt;/pre&gt; &lt;p&gt;As far as I'm aware, Factor properly handles interruptible system calls, and has for a while now, thanks to &lt;a href="http://code-factor.blogspot.com"&gt;Doug Coleman&lt;/a&gt; explaining the issue to me 4 years ago. Not having to deal with crap like this (not to mention being able to write cross-platform code that runs on both Unix and Windows) is one of the advantages of using a high-level language like Factor or Java over C.&lt;/p&gt; &lt;h3&gt;Subprocesses inherit semi-random things from the parent process&lt;/h3&gt; &lt;p&gt;When you &lt;code&gt;fork()&lt;/code&gt; your process, various things are copied from the parent to the child; environment variables, file descriptors, the ignored signal mask, and so on. Less obvious is the fact that &lt;code&gt;exec()&lt;/code&gt; doesn't reset everything. If shared file descriptors such as stdin and stdout were set to non-blocking in the parent, the child will start with these descriptors non-blocking also, which will most likely break most programs. &lt;a href="http://factor-language.blogspot.com/2008/07/dont-set-shared-file-descriptors-to-non.html"&gt;I've blogged about this problem before&lt;/a&gt;.&lt;/p&gt; &lt;p&gt;A similar issue is that if you elect to ignore certain signals with the &lt;code&gt;SIG_IGN&lt;/code&gt; action using &lt;code&gt;sigaction()&lt;/code&gt;, then subprocesses will inherit this behavior. Again, this can break processes. Until yesterday, Factor would ignore &lt;code&gt;SIGPIPE&lt;/code&gt; using this mechanism, and child processes spawned with the &lt;code&gt;io.launcher&lt;/code&gt; vocabulary that expected to receive &lt;code&gt;SIGPIPE&lt;/code&gt; would not work properly. There are various workarounds; you can reset the signal mask before the &lt;code&gt;exec()&lt;/code&gt; call, or you can do what I did in Factor, and ignore the signal by giving it an empty signal handler instead of a &lt;code&gt;SIG_IGN&lt;/code&gt; action.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17087850-5808076548128250184?l=factor-language.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://factor-language.blogspot.com/feeds/5808076548128250184/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17087850&amp;postID=5808076548128250184' title='7 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/5808076548128250184'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/5808076548128250184'/><link rel='alternate' type='text/html' href='http://factor-language.blogspot.com/2010/09/two-things-every-unix-developer-should.html' title='Two things every Unix developer should know'/><author><name>Slava Pestov</name><uri>http://www.blogger.com/profile/02768382790667979877</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>7</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17087850.post-3474927088354023566</id><published>2010-07-28T02:36:00.002-04:00</published><updated>2010-07-28T02:36:39.168-04:00</updated><title type='text'>Overhauling Factor's C library interface</title><content type='html'>&lt;p&gt;For the last while I've been working on an overhaul of Factor's &lt;a href="http://docs.factorcode.org/content/article-alien.html"&gt;C library interface&lt;/a&gt; and associated compiler infrastructure. The goals of the rewrite were three-fold:&lt;/p&gt; &lt;ul&gt; &lt;li&gt;Improving performance&lt;/li&gt; &lt;li&gt;Cleaning up some crusty old code that dates back to my earliest experiments with native code generation&lt;/li&gt; &lt;li&gt;Laying the groundwork for future compiler improvements&lt;/li&gt; &lt;/ul&gt; &lt;p&gt;These changes were all behind-the-scenes; I did not introduce any new functionality or language changes, and I think Factor's FFI is already quite stable and feature-complete.&lt;/p&gt; &lt;h3&gt;The previous FFI implementation&lt;/h3&gt; &lt;p&gt;I started work on Factor's C library interface (FFI) almost immediately after I bootstrapped the native implementation off of JFactor. I began experimenting with an SDL-based UI early on and immediately decided I wanted to have a real FFI, instead of extending the VM with primitives which wrap C functions by hand.&lt;/p&gt; &lt;p&gt;Over time, both the FFI and compiler have evolved in parallel, but there was little integration between them, other than the fact that both used the same &lt;a href="http://factor-language.blogspot.com/2009/09/survey-of-domain-specific-languages-in.html"&gt;assembler DSL&lt;/a&gt; under the hood. The result is that FFI calls were generated in a somewhat inefficient way. Since the optimizing compiler had no knowledge of how they work, it would save all values to the data stack first. Then the FFI code generator would take over; it would pop the input parameters one by one, pass them to a per-type C function in the Factor VM which unboxed the value, then stored the value in the stack frame. Finally, the target C function would be invoked, then another C function in the VM would box the return value, and the return value would be pushed to the stack. The optimizing compiler would then take over, possibily generating code to pop the value from the stack.&lt;/p&gt; &lt;p&gt;The redundant stack traffic was wasteful. In some cases, the optimizing compiler would generate code to box a value and push it to the stack, only to have the FFI then generate code to pop it and unbox it immediately after. To make matters worse, over time the optimizing compiler gained the ability to box and unbox values with open-coded assembly sequences, but the FFI would still make calls into the VM to do it.&lt;/p&gt; &lt;p&gt;All in all, it was about time I rewrote the FFI, modernizing it and integrating it with the rest of the compiler in the process.&lt;/p&gt; &lt;h3&gt;Factoring FFI calls into simpler instructions&lt;/h3&gt; &lt;p&gt;Most &lt;a href="http://github.com/slavapestov/factor/blob/master/basis/compiler/cfg/instructions/instructions.factor"&gt;low-level IR instructions&lt;/a&gt; are very simple; FFI calls used to be the exception. Now I've split these up into smaller instructions. Parameters and return values are now read and written from the stack with the same &lt;code&gt;##peek&lt;/code&gt; and &lt;code&gt;##replace&lt;/code&gt; instructions that everything else uses, and boxing and unboxing parameters and return values is now done with the &lt;a href="http://factor-language.blogspot.com/2010/05/collection-of-small-compiler.html"&gt;representation selection pass&lt;/a&gt;. A couple of oddball C types, such as &lt;code&gt;long long&lt;/code&gt; on x86-32, still require VM calls to box and unbox, and I added new instructions for those.&lt;/p&gt; &lt;p&gt;One slightly tricky thing that came up was that I had to re-engineer low-level IR to support instructions with multiple output values. This is required for C calls which return structs and &lt;code&gt;long long&lt;/code&gt; types by value, since each word-size chunk of the return value is returned in its own register, and these chunks have to be re-assembled later. In the future, I will be able to use this support to add instructions such as the x86 division instruction, which computes &lt;code&gt;x / y&lt;/code&gt; and &lt;code&gt;x mod y&lt;/code&gt; simultaneously.&lt;/p&gt; &lt;p&gt;I also had to change low-level IR to distinguish between instructions with side effects and those without. Previously, optimization passes would assume any instruction with an output value did not have side effects, and could be deleted if the output value was not used. This is no longer true for C calls; a C function might both have a side effect and return a value.&lt;/p&gt; &lt;h3&gt;GC maps&lt;/h3&gt; &lt;p&gt;Now that FFI calls no longer force the optimizer to sync all live values to the data and retain stacks, it can happen that SSA values are live across an FFI call. These values get spilled to the call stack by the register allocator. Spilling to the call stack is cheaper than spilling to the data and retain stacks, because floating point values and integers do not need to be boxed, and since the spilling is done later in the optimization process, optimizations can proceed across the call site instead of being stumped by pushes and pops on either side.&lt;/p&gt; &lt;p&gt;However, since FFI calls can invoke callbacks, which in turn run Factor code, which can trigger a garbage collection, the garbage collector must be able to identify spill slots in call frames which contain tagged pointers.&lt;/p&gt; &lt;p&gt;The technique I'm using is to record "GC maps". This idea comes from a paper titled &lt;a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.87.71&amp;rep=rep1&amp;type=pdf"&gt;Compiler Support for Garbage Collection in a Statically Typed Language&lt;/a&gt; (Factor is dynamically typed, and the paper itself doesn't have anything specific to static typing in it, so I found the title a bit odd). The Java HotSpot VM uses the same technique. The basic idea is that for every call site, you record a bitmap indicating which spill slots contain tagged pointers. This information is then stored in a per-code-block map, where the keys are return addresses and the values are these bitmaps.&lt;/p&gt; &lt;p&gt;In the future, I intend on using GC maps at call sites of Factor words as well, instead of spilling temporary values to the retain stack; then I can eliminate the retain stack altogether, freeing up a register. After this is done the data stack will only be used to pass parameters between words, and not to store temporaries within a word. This will allow more values to be unboxed in more situations, and it will improve accuracy of compiler analyses.&lt;/p&gt; &lt;p&gt;In fact, getting GC maps worked out was my primary motivation for this FFI rewrite; the code cleanups and performance improvements were just gravy.&lt;/p&gt; &lt;h3&gt;Callback support improvements&lt;/h3&gt; &lt;p&gt;Another one of those things that only makes sense when you look at how Factor evolved is that the body of an FFI callback was compiled with the non-optimizing compiler, rather than the optimizing compiler. It used to be that only certain definitions could be optimized, because static stack effects were optional and there were many common idioms which did not have static stack effects. It has been more than a year since I undertook the engineering effort to make the compiler enforce &lt;a href="http://factor-language.blogspot.com/2009/03/better-static-safety-for-higher-order.html"&gt;static stack safety&lt;/a&gt;, and the implementation of callbacks was the last vestigial remnant from the bad old days.&lt;/p&gt; &lt;p&gt;This design made callbacks harder to debug than they should be; if you used up too many parameters, or forgot to push a return value, you'd be greeted with a runtime error instead of a compiler error. Now this has been fixed, and callbacks are fully compiled with the optimizing compiler.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17087850-3474927088354023566?l=factor-language.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://factor-language.blogspot.com/feeds/3474927088354023566/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17087850&amp;postID=3474927088354023566' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/3474927088354023566'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/3474927088354023566'/><link rel='alternate' type='text/html' href='http://factor-language.blogspot.com/2010/07/overhauling-factors-c-library-interface.html' title='Overhauling Factor&apos;s C library interface'/><author><name>Slava Pestov</name><uri>http://www.blogger.com/profile/02768382790667979877</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17087850.post-4942585095767393047</id><published>2010-07-02T00:53:00.002-04:00</published><updated>2010-07-02T00:57:54.288-04:00</updated><title type='text'>Factor talk in Boston, July 26th</title><content type='html'>I will be presenting Factor at the Boston Lisp Users' Group on July 26th, 2010. Details in &lt;a href="http://fare.livejournal.com/157280.html"&gt;François-René Rideau's announcement&lt;/a&gt;. I'm also going to be giving a quick talk at the &lt;a href="http://emerginglangs.com/"&gt;Emerging Languages Camp&lt;/a&gt; in Portland, July 22nd. Unfortunately registration for this camp is already full.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17087850-4942585095767393047?l=factor-language.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://factor-language.blogspot.com/feeds/4942585095767393047/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17087850&amp;postID=4942585095767393047' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/4942585095767393047'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/4942585095767393047'/><link rel='alternate' type='text/html' href='http://factor-language.blogspot.com/2010/07/factor-talk-in-boston-july-26th.html' title='Factor talk in Boston, July 26th'/><author><name>Slava Pestov</name><uri>http://www.blogger.com/profile/02768382790667979877</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17087850.post-2177055053070299407</id><published>2010-05-29T04:46:00.019-04:00</published><updated>2010-05-31T15:06:35.879-04:00</updated><title type='text'>Comparing Factor's performance against V8, LuaJIT, SBCL, and CPython</title><content type='html'>&lt;p&gt;Together with &lt;a href="http://useless-factor.blogspot.com"&gt;Daniel Ehrenberg&lt;/a&gt; and &lt;a href="http://duriansoftware.com/joe"&gt;Joe Groff&lt;/a&gt;, I'm writing a paper about Factor for &lt;a href="http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=9566&amp;copyownerid=3264"&gt;DLS2010&lt;/a&gt;. We would appreciate feedback about the &lt;a href="http://useless-factor.blogspot.com/2010/05/paper-for-dls-2010.html"&gt;draft version of the paper&lt;/a&gt;. As part of the paper we include a performance comparison between Factor, V8, LuaJIT, SBCL, and Python. The performance comparison consists of some benchmarks from the &lt;a href="http://shootout.alioth.debian.org/"&gt;The Computer Language Benchmarks Game&lt;/a&gt;. I'm posting the results here first, in case there's something really stupid here.&lt;/p&gt; &lt;h3&gt;Language implementations&lt;/h3&gt; &lt;p&gt;Factor and V8 were built from their respective repositories. SBCL is version 1.0.38. LuaJIT is version 2.0.0beta4. CPython is version 3.1.2. All language implementations were built as 64-bit binaries and run on an 2.4 GHz Intel Core 2 Duo.&lt;/p&gt; &lt;h3&gt;Benchmark implementations&lt;/h3&gt; &lt;p&gt;Factor implementations of the benchmarks can be found in our source repository:&lt;/p&gt; &lt;ul&gt; &lt;li&gt;&lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=extra/benchmark/binary-trees/binary-trees.factor;hb=HEAD"&gt;binary-trees&lt;/a&gt;&lt;/li&gt; &lt;li&gt;&lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=extra/benchmark/fasta/fasta.factor;hb=HEAD"&gt;fasta&lt;/a&gt;&lt;/li&gt; &lt;li&gt;&lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=extra/benchmark/knucleotide/knucleotide.factor;hb=HEAD"&gt;knucleotide&lt;/a&gt;&lt;/li&gt; &lt;li&gt;&lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=extra/benchmark/nbody-simd/nbody-simd.factor;hb=HEAD"&gt;nbody&lt;/a&gt;&lt;/li&gt; &lt;li&gt;&lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=extra/benchmark/regex-dna/regex-dna.factor;hb=HEAD"&gt;regex-dna&lt;/a&gt;&lt;/li&gt; &lt;li&gt;&lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=extra/benchmark/reverse-complement/reverse-complement.factor;hb=HEAD"&gt;reverse-complement&lt;/a&gt;&lt;/li&gt; &lt;li&gt;&lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=extra/benchmark/spectral-norm/spectral-norm.factor;hb=HEAD"&gt;spectral-norm&lt;/a&gt;&lt;/li&gt; &lt;/ul&gt; &lt;p&gt;Implementations for the other languages can be found at the language benchmark game &lt;a href="https://alioth.debian.org/scm/?group_id=30402"&gt;CVS repository&lt;/a&gt;:&lt;/p&gt;&lt;table border="3"&gt; &lt;tr&gt;&lt;th&gt;&lt;/th&gt;&lt;th&gt;LuaJIT&lt;/th&gt;&lt;th&gt;SBCL&lt;/th&gt;&lt;th&gt;V8&lt;/th&gt;&lt;th&gt;CPython&lt;/th&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;binary-trees          &lt;/td&gt;&lt;td&gt;&lt;a href="https://alioth.debian.org/scm/viewvc.php/shootout/bench/binarytrees/binarytrees.lua-2.lua?view=markup&amp;root=shootout"&gt;binarytrees.lua-2.lua&lt;/a&gt;&lt;/td&gt;&lt;td&gt;&lt;a href="https://alioth.debian.org/scm/viewvc.php/shootout/bench/binarytrees/binarytrees.sbcl?view=markup&amp;root=shootout"&gt;binarytrees.sbcl&lt;/a&gt;                 &lt;/td&gt;&lt;td&gt;&lt;a href="https://alioth.debian.org/scm/viewvc.php/shootout/bench/binarytrees/binarytrees.javascript?view=markup&amp;root=shootout"&gt;binarytrees.javascript&lt;/a&gt;                          &lt;/td&gt;&lt;td&gt;&lt;a href="https://alioth.debian.org/scm/viewvc.php/shootout/bench/binarytrees/binarytrees.python3-6.python3?view=markup&amp;root=shootout"&gt;binarytrees.python3-6.python3&lt;/a&gt;   &lt;/tr&gt; &lt;tr&gt;&lt;td&gt;fasta                 &lt;/td&gt;&lt;td&gt;&lt;a href="https://alioth.debian.org/scm/viewvc.php/shootout/bench/fasta/fasta.lua?view=markup&amp;root=shootout"&gt;fasta.lua&lt;/a&gt;                              &lt;/td&gt;&lt;td&gt;&lt;a href="https://alioth.debian.org/scm/viewvc.php/shootout/bench/fasta/fasta.sbcl?view=markup&amp;root=shootout"&gt;fasta.sbcl&lt;/a&gt;                                   &lt;/td&gt;&lt;td&gt;&lt;a href="https://alioth.debian.org/scm/viewvc.php/shootout/bench/fasta/fasta.javascript-2.javascript?view=markup&amp;root=shootout"&gt;fasta.javascript-2.javascript&lt;/a&gt;                  &lt;/td&gt;&lt;td&gt;&lt;a href="https://alioth.debian.org/scm/viewvc.php/shootout/bench/fasta/fasta.python3-2.python3?view=markup&amp;root=shootout"&gt;fasta.python3-2.python3&lt;/a&gt;                     &lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;knucleotide           &lt;/td&gt;&lt;td&gt;&lt;a href="https://alioth.debian.org/scm/viewvc.php/shootout/bench/knucleotide/knucleotide.lua-2.lua?view=markup&amp;root=shootout"&gt;knucleotide.lua-2.lua&lt;/a&gt;&lt;/td&gt;&lt;td&gt;&lt;a href="https://alioth.debian.org/scm/viewvc.php/shootout/bench/knucleotide/knucleotide.sbcl-3.sbcl?view=markup&amp;root=shootout"&gt;knucleotide.sbcl-3.sbcl&lt;/a&gt;   &lt;/td&gt;&lt;td&gt;&lt;a href="https://alioth.debian.org/scm/viewvc.php/shootout/bench/knucleotide/knucleotide.javascript-3.javascript?view=markup&amp;root=shootout"&gt;knucleotide.javascript-3.javascript&lt;/a&gt;&lt;/td&gt;&lt;td&gt;&lt;a href="https://alioth.debian.org/scm/viewvc.php/shootout/bench/knucleotide/knucleotide.python3-4.python3?view=markup&amp;root=shootout"&gt;knucleotide.python3-4.python3&lt;/a&gt;   &lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;nbody                 &lt;/td&gt;&lt;td&gt;&lt;a href="https://alioth.debian.org/scm/viewvc.php/shootout/bench/nbody/nbody.lua-2.lua?view=markup&amp;root=shootout"&gt;nbody.lua-2.lua&lt;/a&gt;                  &lt;/td&gt;&lt;td&gt;&lt;a href="https://alioth.debian.org/scm/viewvc.php/shootout/bench/nbody/nbody.sbcl?view=markup&amp;root=shootout"&gt;nbody.sbcl&lt;/a&gt;                                   &lt;/td&gt;&lt;td&gt;&lt;a href="https://alioth.debian.org/scm/viewvc.php/shootout/bench/nbody/nbody.javascript?view=markup&amp;root=shootout"&gt;nbody.javascript&lt;/a&gt;                                            &lt;/td&gt;&lt;td&gt;&lt;a href="https://alioth.debian.org/scm/viewvc.php/shootout/bench/nbody/nbody.python3-4.python3 ?view=markup&amp;root=shootout"&gt;nbody.python3-4.python3&lt;/a&gt;                    &lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;regex-dna             &lt;/td&gt;&lt;td&gt;                                                                                                                                                       &lt;/td&gt;&lt;td&gt;&lt;a href="https://alioth.debian.org/scm/viewvc.php/shootout/bench/regexdna/regexdna.sbcl-3.sbcl?view=markup&amp;root=shootout"&gt;regexdna.sbcl-3.sbcl&lt;/a&gt;            &lt;/td&gt;&lt;td&gt;&lt;a href="https://alioth.debian.org/scm/viewvc.php/shootout/bench/regexdna/regexdna.javascript?view=markup&amp;root=shootout"&gt;regexdna.javascript&lt;/a&gt;                                   &lt;/td&gt;&lt;td&gt;&lt;a href="https://alioth.debian.org/scm/viewvc.php/shootout/bench/regexdna/regexdna.python3 ?view=markup&amp;root=shootout"&gt;regexdna.python3&lt;/a&gt;                               &lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;reverse-complement    &lt;/td&gt;&lt;td&gt;&lt;a href="https://alioth.debian.org/scm/viewvc.php/shootout/bench/revcomp/revcomp.lua?view=markup&amp;root=shootout"&gt;revcomp.lua&lt;/a&gt;                        &lt;/td&gt;&lt;td&gt;&lt;a href="https://alioth.debian.org/scm/viewvc.php/shootout/bench/revcomp/revcomp.sbcl?view=markup&amp;root=shootout"&gt;revcomp.sbcl&lt;/a&gt;                             &lt;/td&gt;&lt;td&gt;&lt;a href="https://alioth.debian.org/scm/viewvc.php/shootout/bench/revcomp/revcomp.javascript-2.javascript?view=markup&amp;root=shootout"&gt;revcomp.javascript-2.javascript&lt;/a&gt;            &lt;/td&gt;&lt;td&gt;&lt;a href="https://alioth.debian.org/scm/viewvc.php/shootout/bench/revcomp/revcomp.python3-4.python3 ?view=markup&amp;root=shootout"&gt;revcomp.python3-4.python3&lt;/a&gt;              &lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;spectral-norm         &lt;/td&gt;&lt;td&gt;&lt;a href="https://alioth.debian.org/scm/viewvc.php/shootout/bench/spectralnorm/spectralnorm.lua?view=markup&amp;root=shootout"&gt;spectralnorm.lua&lt;/a&gt;         &lt;/td&gt;&lt;td&gt;&lt;a href="https://alioth.debian.org/scm/viewvc.php/shootout/bench/spectralnorm/spectralnorm.sbcl-3.sbcl?view=markup&amp;root=shootout"&gt;spectralnorm.sbcl-3.sbcl&lt;/a&gt;&lt;/td&gt;&lt;td&gt;&lt;a href="https://alioth.debian.org/scm/viewvc.php/shootout/bench/spectralnorm/spectralnorm.javascript?view=markup&amp;root=shootout"&gt;spectralnorm.javascript&lt;/a&gt;                       &lt;/td&gt;&lt;td&gt;&lt;a href="https://alioth.debian.org/scm/viewvc.php/shootout/bench/spectralnorm/spectralnorm.python3-5.python3?view=markup&amp;root=shootout"&gt;spectralnorm.python3-5.python3&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt; &lt;/table&gt;&lt;p&gt;In order to make the reverse complement benchmark work with SBCL on Mac OS X, I had to apply this patch; I don't understand why:&lt;/p&gt; &lt;pre&gt;--- bench/revcomp/revcomp.sbcl 9 Feb 2007 17:17:26 -0000 1.4&lt;br /&gt;+++ bench/revcomp/revcomp.sbcl 29 May 2010 08:32:19 -0000&lt;br /&gt;@@ -26,8 +26,7 @@&lt;br /&gt; &lt;br /&gt; (defun main ()&lt;br /&gt;   (declare (optimize (speed 3) (safety 0)))&lt;br /&gt;-  (with-open-file (in "/dev/stdin" :element-type +ub+)&lt;br /&gt;-    (with-open-file (out "/dev/stdout" :element-type +ub+ :direction :output :if-exists :append)&lt;br /&gt;+  (let ((in sb-sys:*stdin*) (out sb-sys:*stdout*))&lt;br /&gt;       (let ((i-buf (make-array +buffer-size+ :element-type +ub+))&lt;br /&gt;             (o-buf (make-array +buffer-size+ :element-type +ub+))&lt;br /&gt;             (chunks nil))&lt;br /&gt;@@ -72,4 +71,4 @@&lt;br /&gt;                         (setf start 0)&lt;br /&gt;                         (go read-chunk))))&lt;br /&gt;            end-of-input&lt;br /&gt;-             (flush-chunks)))))))&lt;br /&gt;+             (flush-chunks))))))&lt;br /&gt;&lt;/pre&gt; &lt;h3&gt;Running the benchmarks&lt;/h3&gt; &lt;p&gt;I used Factor's deploy tool to generate minimal images for the Factor benchmarks, and then ran them from the command line:&lt;/P&gt; &lt;pre&gt;./factor -e='USE: tools.deploy "benchmark.nbody-simd" deploy'&lt;br /&gt;time benchmark.nbody-simd.app/Contents/MacOS/benchmark.nbody-simd&lt;/pre&gt; &lt;p&gt;For the scripting language implementations (LuaJIT and V8) I ran the scripts from the command line:&lt;/p&gt; &lt;pre&gt;time ./d8 ~/perf/shootout/bench/nbody/nbody.javascript -- 1000000&lt;br /&gt;time ./src/luajit ~/perf/shootout/bench/nbody/nbody.lua-2.lua 1000000&lt;/pre&gt; &lt;p&gt;For SBCL, I did what the shootout does, and compiled each file into a new core:&lt;/p&gt; &lt;pre&gt;ln -s ~/perf/shootout/bench/nbody/nbody.sbcl .&lt;br /&gt;&lt;br /&gt;cat &gt; nbody.sbcl_compile &amp;lt;&amp;lt;EOF&lt;br /&gt;(proclaim '(optimize (speed 3) (safety 0) (debug 0) (compilation-speed 0) (space 0)))&lt;br /&gt;(handler-bind ((sb-ext:defconstant-uneql (lambda (c) (abort c))))&lt;br /&gt;  (load (compile-file "nbody.sbcl" )))&lt;br /&gt;(save-lisp-and-die "nbody.core" :purify t)&lt;br /&gt;EOF&lt;br /&gt;&lt;br /&gt;sbcl --userinit /dev/null --load nbody.sbcl_compile&lt;br /&gt;&lt;br /&gt;cat &gt; nbody.sbcl_run &amp;lt;&amp;lt;EOF&lt;br /&gt;(proclaim '(optimize (speed 3) (safety 0) (debug 0) (compilation-speed 0) (space 0)))&lt;br /&gt;(main) (quit)&lt;br /&gt;EOF&lt;br /&gt;&lt;br /&gt;time sbcl --dynamic-space-size 500 --noinform --core nbody.core --userinit /dev/null --load nbody.sbcl_run 1000000&lt;/pre&gt;&lt;p&gt;For CPython, I precompiled each script into bytecode first:&lt;/p&gt;&lt;pre&gt;python3.1 -OO -c "from py_compile import compile; compile('nbody.python3-4.py')"&lt;/pre&gt; &lt;h3&gt;Benchmark results&lt;/h3&gt;&lt;p&gt;All running times are wall clock time from the Unix &lt;code&gt;time&lt;/code&gt; command. I ran each benchmark 5 times and used the best result.&lt;/p&gt;&lt;table border='3'&gt; &lt;tr&gt;&lt;td&gt;                  &lt;/td&gt;&lt;th&gt;Factor&lt;/th&gt;&lt;th&gt;LuaJIT&lt;/th&gt;&lt;th&gt;SBCL  &lt;/th&gt;&lt;th&gt;V8     &lt;/th&gt;&lt;th&gt;CPython   &lt;/th&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;fasta             &lt;/td&gt;&lt;td&gt;2.597s&lt;/td&gt;&lt;td&gt;1.689s&lt;/td&gt;&lt;td&gt;2.105s&lt;/td&gt;&lt;td&gt;3.948s &lt;/td&gt;&lt;td&gt;35.234s  &lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;reverse-complement&lt;/td&gt;&lt;td&gt;2.377s&lt;/td&gt;&lt;td&gt;1.764s&lt;/td&gt;&lt;td&gt;2.955s&lt;/td&gt;&lt;td&gt;3.884s &lt;/td&gt;&lt;td&gt;1.669s   &lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;nbody             &lt;/td&gt;&lt;td&gt;0.393s&lt;/td&gt;&lt;td&gt;0.604s&lt;/td&gt;&lt;td&gt;0.402s&lt;/td&gt;&lt;td&gt;4.569s &lt;/td&gt;&lt;td&gt;37.086s  &lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;binary-trees      &lt;/td&gt;&lt;td&gt;1.764s&lt;/td&gt;&lt;td&gt;6.295s&lt;/td&gt;&lt;td&gt;1.349s&lt;/td&gt;&lt;td&gt;2.119s &lt;/td&gt;&lt;td&gt;19.886s  &lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;spectral-norm     &lt;/td&gt;&lt;td&gt;1.377s&lt;/td&gt;&lt;td&gt;1.358s&lt;/td&gt;&lt;td&gt;2.229s&lt;/td&gt;&lt;td&gt;12.227s&lt;/td&gt;&lt;td&gt;1m44.675s&lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;regex-dna         &lt;/td&gt;&lt;td&gt;0.990s&lt;/td&gt;&lt;td&gt;N/A   &lt;/td&gt;&lt;td&gt;0.973s&lt;/td&gt;&lt;td&gt;0.166s &lt;/td&gt;&lt;td&gt;0.874s   &lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;knucleotide       &lt;/td&gt;&lt;td&gt;1.820s&lt;/td&gt;&lt;td&gt;0.573s&lt;/td&gt;&lt;td&gt;0.766s&lt;/td&gt;&lt;td&gt;1.876s &lt;/td&gt;&lt;td&gt;1.805s   &lt;/td&gt;&lt;/tr&gt; &lt;/table&gt;&lt;br /&gt;&lt;h3&gt;Benchmark analysis&lt;/h3&gt; &lt;p&gt;Some notes on the results:&lt;/p&gt; &lt;ul&gt; &lt;li&gt;There is no Lua implementation of the regex-dna benchmark.&lt;/li&gt;&lt;li&gt;Some of the SBCL benchmark implementations can make use of multiple cores if SBCL is compiled with thread support. However, by default, thread support seems to be disabled on Mac OS X. None of the other language implementations being tested have native thread support, so this is a single-core performance test.&lt;/li&gt;&lt;li&gt;Factor's string manipulation still needs work. The fasta, knucleotide and reverse-complement benchmarks are not as fast as they should be.&lt;/li&gt; &lt;li&gt;The binary-trees benchmark is a measure of how fast objects can be allocated, and how fast the garbage collector can reclaim dead objects. LuaJIT loses big here, perhaps because it lacks generational garbage collection, and because Lua's tables are an inefficient object representation.&lt;/li&gt; &lt;li&gt;The regex-dna benchmark is a measure of how efficient the regular expression implementation is in the language. V8 wins here, because it uses Google's heavily-optimized Irregexp library.&lt;/li&gt; &lt;li&gt;Factor beats the other implementations on the nbody benchmark because it is able to make use of SIMD.&lt;/li&gt; &lt;li&gt;For some reason SBCL is slower than the others on spectral-norm. It should be generating the same code.&lt;/li&gt; &lt;li&gt;The benchmarks exercise insufficiently-many language features. Any benchmark that uses native-sized integers (for example, an implementation of the SHA1 algorithm) would shine on SBCL and suffer on all the others. Similarly, any benchmark that requires packed binary data support would shine on Factor and suffer on all the others. However, the benchmarks in the shootout mostly consist of scalar floating point code, and text manipulation only.&lt;/li&gt;&lt;/ul&gt;&lt;h3&gt;Conclusions&lt;/h3&gt;&lt;p&gt;Factor's performance is coming along nicely. I'd like to submit Factor to the computer language shootout soon. Before doing that, we need a Debian package, and the deploy tool needs to be easier to use from the command line.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17087850-2177055053070299407?l=factor-language.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://factor-language.blogspot.com/feeds/2177055053070299407/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17087850&amp;postID=2177055053070299407' title='16 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/2177055053070299407'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/2177055053070299407'/><link rel='alternate' type='text/html' href='http://factor-language.blogspot.com/2010/05/comparing-factors-performance-against.html' title='Comparing Factor&apos;s performance against V8, LuaJIT, SBCL, and CPython'/><author><name>Slava Pestov</name><uri>http://www.blogger.com/profile/02768382790667979877</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>16</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17087850.post-5043552812433752925</id><published>2010-05-10T18:23:00.003-04:00</published><updated>2010-05-10T18:27:55.947-04:00</updated><title type='text'>A collection of small compiler improvements</title><content type='html'>&lt;p&gt;One of the big optimizations I'm planning on implementing in Factor is support for computations on machine-word-sized integers. The motivation is as follows. While code that operates on small integers (fixnums) does not currently allocate memory for intermediate values, and as a result can compile very efficiently if type checks are eliminated, sometimes fixnum precision is not quite enough. Using bignums in algorithms such as SHA1 that require 32-bit or 64-bit arithmetic incurs a big performance hit over writing the code in a &lt;a href="http://bitbucket.org/kssreeram/clay"&gt;systems language&lt;/a&gt;. My plan is to have the compiler support machine integers as an intermediate step between fixnums and bignums. Machine integers would be boxed and unboxed at call boundaries, but tight loops operating on them will run in registers.&lt;/p&gt; &lt;p&gt;While I haven't yet implemented the above optimization, I've laid the groundwork and made it possible to represent operations on untagged integers at the level of compiler IR at least. While unboxed floats do not cause any complications for the GC, unboxed integers do, because they share a register file with tagged pointers. To make Factor's precise GC work, the compiler can now distinguish registers containing tagged pointers from those containing arbitrary integer data, by breaking down the register class of integer registers into two "representations", tagged and untagged.&lt;/p&gt; &lt;p&gt;Since the above reworking touched many different parts of the compiler, I also took the time to implement some minor optimizations and clean up some older code. These improvements will be the topic of this post.&lt;/p&gt; &lt;p&gt;To understand this post better, you should probably skim the list of low-level IR instructions defined in &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=basis/compiler/cfg/instructions/instructions.factor;hb=HEAD"&gt;compiler.cfg.instructions&lt;/a&gt; first. Also, feel free to post comments here or e-mail me questions about the design of the Factor compiler.&lt;/p&gt; &lt;h3&gt;Improvements to value numbering&lt;/h3&gt; &lt;p&gt;In the x86 architecture, almost any instruction can take a memory operand, and memory operands take the form &lt;code&gt;(base + displacement * scale + offset)&lt;/code&gt;, where &lt;code&gt;base&lt;/code&gt; and &lt;code&gt;displacement&lt;/code&gt; are registers, &lt;code&gt;offset&lt;/code&gt; is a 32-bit signed integer, and &lt;code&gt;scale&lt;/code&gt; is 1, 2, 4 or 8. While using a complex addressing mode is not any faster than writing out the arithmetic by hand with a temporary register used to store the final address, it can reduce register pressure and code size.&lt;/p&gt; &lt;p&gt;The compiler now makes better use of complex addressing modes. Prior to optimization, the low level IR builder lowers specialized array access to the &lt;code&gt;##load-memory-imm&lt;/code&gt; and &lt;code&gt;##store-memory-imm&lt;/code&gt; instructions. For example, consider the following code:&lt;/p&gt; &lt;pre&gt;USING: alien.c-types specialized-arrays ;&lt;br /&gt;SPECIALIZED-ARRAY: float&lt;br /&gt;&lt;br /&gt;: make-data ( -- seq ) 10 iota [ 3.5 + sqrt ] float-array{ } map-as ;&lt;/pre&gt; The inner loop consists of the following low level IR: &lt;pre&gt;##integer&gt;float 23 14&lt;br /&gt;##sqrt 24 23&lt;br /&gt;##load-integer 25 2&lt;br /&gt;##slot-imm 26 15 2 7&lt;br /&gt;##load-integer 27 4&lt;br /&gt;##mul 28 14 27&lt;br /&gt;##tagged&gt;integer 29 26&lt;br /&gt;##add-imm 30 29 7&lt;br /&gt;##add 31 30 28&lt;br /&gt;##store-memory-imm 24 31 0 float-rep f&lt;br /&gt;##load-integer 32 1&lt;br /&gt;##add 33 14 32&lt;/pre&gt; &lt;p&gt;The &lt;code&gt;##mul&lt;/code&gt;, &lt;code&gt;##add-imm&lt;/code&gt; and &lt;code&gt;##add&lt;/code&gt; instructions compute the address input to &lt;code&gt;##store-memory-imm&lt;/code&gt;. After value numbering, these three instructions have been fused with &lt;code&gt;##store-memory-imm&lt;/code&gt; to form &lt;code&gt;##store-memory&lt;/code&gt;:&lt;/p&gt; &lt;pre&gt;##integer&gt;float 62 57&lt;br /&gt;##sqrt 63 62&lt;br /&gt;##slot-imm 65 48 2 7&lt;br /&gt;##tagged&gt;integer 68 65&lt;br /&gt;##store-memory 63 68 57 2 7 float-rep f&lt;br /&gt;##add-imm 72 57 1&lt;/pre&gt; &lt;h3&gt;Optimistic copy propagation&lt;/h3&gt; &lt;p&gt;While the low-level optimizer's value numbering and alias analysis passes only operate on individual basic blocks (&lt;i&gt;local&lt;/i&gt; optimizations), the copy propagation pass is &lt;i&gt;global&lt;/i&gt;, meaning it takes the entire procedure being compiled into account.&lt;/p&gt; &lt;p&gt;Eventually I will merge the local alias analysis and value numbering passes with the copy propagation pass to get an optimization known as "global value numbering with redundant load elimination". One step along this road is a rewrite that makes the copy propagation pass &lt;i&gt;optimistic&lt;/i&gt;, in the following sense.&lt;/p&gt; &lt;p&gt;Copy propagation concerns itself with eliminating &lt;code&gt;##copy&lt;/code&gt; instructions, which correspond to assignments from one SSA value to another (new) value:&lt;/p&gt; &lt;pre&gt;y = x&lt;/pre&gt; &lt;p&gt;When such an instruction is processed, all usages of the value &lt;code&gt;y&lt;/code&gt; can be replaced with the value &lt;code&gt;x&lt;/code&gt;, in every basic block, and the definition of &lt;code&gt;y&lt;/code&gt; deleted.&lt;/p&gt; &lt;p&gt;A simple extension of the above treats a &lt;code&gt;##phi&lt;/code&gt; instruction where all inputs are equal the same as a copy. This optimizes cases such as:&lt;/p&gt; &lt;pre&gt; a = ...&lt;br /&gt;if(...)&lt;br /&gt;{&lt;br /&gt;    b = a&lt;br /&gt;}&lt;br /&gt;c = phi(b,a)&lt;/pre&gt; &lt;p&gt;The case of &lt;code&gt;##phi&lt;/code&gt; instructions where one of the inputs is carried by a back edge in the control flow graph (ie, a loop) is where the optimistic assumption comes in. Consider the following code, where the definition of &lt;code&gt;x''&lt;/code&gt; is carried by a back edge:&lt;/p&gt; &lt;pre&gt;x' = 0&lt;br /&gt;&lt;br /&gt;while(...)&lt;br /&gt;{&lt;br /&gt;    x = phi(x',x'')&lt;br /&gt;    x'' = ...&lt;br /&gt;}&lt;/pre&gt; &lt;p&gt;Optimistic copy propagation is able to eliminate the phi node entirely. Switching from pessimistic to optimistic copy propagation eliminated about 10% of copy instructions. While this is not a great amount, it did reduce spilling in a few subroutines I looked at, and there is no increase in code complexity from this new approach.&lt;/p&gt; &lt;h3&gt;Representation selection improvements&lt;/h3&gt; &lt;p&gt;The low-level IR now includes three new instructions, &lt;code&gt;##load-double&lt;/code&gt;, &lt;code&gt;##load-float&lt;/code&gt; and &lt;code&gt;##load-vector&lt;/code&gt;. These are inteded for loading unboxed floating point values and SIMD vectors into vector registers without using an integer register to temporarily hold a pointer to the boxed value.&lt;/p&gt; &lt;p&gt;Uses of these instructions are inserted by the representation selection pass. If the destination of a &lt;code&gt;##load-reference&lt;/code&gt; is used in an unboxed representation, then instead of inserting a memory load following the &lt;code&gt;##load-reference&lt;/code&gt;, the instruction is replaced with one of the new variants.&lt;/p&gt; &lt;p&gt;Loads from constant addresses directly into floating point and vector registers are only possible on x86, and not PowerPC, so the optimization is not performed on the latter architecture. On x86-32, an immediate displacement can encode any 32-bit address. On x86-64, loading from an absolute 64-bit address requires an integer register, however instruction pointer-relative addressing is supported with a 32-bit displacement. To make use of this capability, the compiler now supports a new "binary literal area" for unboxed data that compiled code can reference. The unboxed data is placed immediately following a word's machine code, allowing RIP-relative addressing to be used on x86-64.&lt;/p&gt; &lt;p&gt;This optimization, together with value numbering improvements, helps reduce pressure on integer registers in floating point loops.&lt;/p&gt; &lt;h3&gt;Register allocator improvements&lt;/h3&gt; &lt;p&gt;Once representation selection has run, each SSA value has an associated representation and register class. Each representation always belongs to one specific register class; a register class is a finite set of registers from which the register allocator is allowed to choose a register for this type of value.&lt;/p&gt; &lt;p&gt;Before register allocation, the SSA destruction pass transforms &lt;code&gt;##phi&lt;/code&gt; instructions into &lt;code&gt;##copy&lt;/code&gt;s, and then subsequently performs live range coalescing to eliminate as many of the copies as possible.&lt;/p&gt; &lt;p&gt;Coalescing two SSA values from different register classes does not make sense; the compiler will not be able to generate valid machine code if a single SSA value is used as both a float and an integer, since on most CPUs the integer and floating point register files are distinct.&lt;/p&gt; &lt;p&gt;However, coalescing values with different representations, but the same register class, is okay. Consider the case where a double-precision float is computed, and then converted into a single precision float, and subsequently stored into memory. If the double-precision value is not used after the conversion, then the same register that held the input value can be reused to store the result of the single-precision conversion.&lt;/p&gt; &lt;p&gt;Previously this pass only coalesced values with identical representations; now I've generalized it to coalescing values with the same register class but possibly different representations. This reduces register pressure and spilling.&lt;/p&gt; &lt;p&gt;The tricky part about making it work is that the register allocator needs to know the value's representation, not just its register class, for generating correct spills and reloads. For example, a spilled value with register class &lt;code&gt;float-regs&lt;/code&gt; can use anywhere from 4 to 16 bytes in the stack frame, depending on the specific representation: single precision float, double precision float, or SIMD vector. Clearly, if coalescing simply lost this fine-grained representation information and only retained register classes, the register allocator would not have enough information to generate valid code.&lt;/p&gt; &lt;p&gt;The solution is to have live range coalescing compute the equivalence classes of SSA values without actually renaming any usages to point to the canonical representative. The renaming map is made available to the register allocator. Most places in the register allocator where instruction operands are examined make use of the renaming map.&lt;/p&gt; &lt;p&gt;Until the splitting phase, these equivalence classes are in one-to-one correspondence with live intervals, and each live interval has a single machine register and spill slot. However, when spill and reload decisions are made, the register allocator uses the original SSA values to look up representations.&lt;/p&gt; &lt;p&gt;If Factor's compiler did not use SSA form at all, there would still be a copy coalescing pass, and the register allocator could also support coalescing values with different representations, by first performing a dataflow analysis known as &lt;a href="http://en.wikipedia.org/wiki/Reaching_definition"&gt;reaching definitions&lt;/a&gt;. This would propagate representation information to use sites.&lt;/p&gt; &lt;p&gt;Retaining SSA structure all the way until register allocation gives you the results of this reaching definitions analysis "for free". In both the SSA and non-SSA case, you still don't want definitions with multiple different representations reaching the same call site. The SSA equivalent of this would be a phi instruction whose inputs had different representations. The compiler ensures this doesn't happen by first splitting up the set of SSA values into strongly-connected components (SCCs) whose edges are phi nodes. The same representation is then assigned to each member of every SCC; if an agreeable representation is not found then it falls back on boxing and tagging all members.&lt;/p&gt; &lt;p&gt;Notice that while both representation selection and SSA destruction group values into equivalence classes, the groupings do not correspond to each other and neither one is a refinement of the other. Not all copies resulting from phi nodes get coalesced away, so a single SCC may intersect multiple coalescing classes. This can happen if the first input to a phi node is live at the definition point of the second: &lt;pre&gt;b = ...&lt;br /&gt;&lt;br /&gt;if(...)&lt;br /&gt;{&lt;br /&gt;    a = ...&lt;br /&gt;    foo(b)&lt;br /&gt;}&lt;br /&gt;/* 'c' can be coalesced with 'a' or 'b' -- but not both.&lt;br /&gt;All three values belong to the same SCC and must have the same&lt;br /&gt;representation */&lt;br /&gt;c = phi(a,b)&lt;/pre&gt; On the other hand, two values from different SCCs might also get coalesced together:&lt;/p&gt; &lt;pre&gt;a = some-double-float&lt;br /&gt;/* if 'a' is not live after this instruction, then 'a' and 'b'&lt;br /&gt;can share a register. They belong to different SCCs and have&lt;br /&gt;different representations */&lt;br /&gt;b = double-to-single-precision-float(a)&lt;/pre&gt; &lt;h3&gt;Compiler code cleanups&lt;/h3&gt; &lt;p&gt;In addition to just adding new optimizations I've also simplified and refactored the internals of some of the passes I worked on.&lt;/p&gt; &lt;p&gt;One big change is that the "machine representation" used at the very end of the compilation process is gone. Previously, after register allocation had run, the control flow graph would be flattened into a list of instructions with CFG edges replaced by labels and jumps. The code generator would then iterate over this flat representation and emit machine code for each virtual instruction. However since no optimization passes ran on the flat representation, it was generated and then immediately consumed. Combining the linearization pass with the code generation pass let me eliminate about a dozen IR instructions that were specific to the flat representation (anything dealing with labels and jumps to labels). It also sped up compilation time slightly.&lt;/p&gt; &lt;p&gt;Value numbering's common subexpression elimination now uses a simpler and more efficient scheme for hashed lookup of expressions. New algebraic simplifications are also easier to add now, because the def-use graph is easier to traverse.&lt;/p&gt; &lt;p&gt;Some instructions which could be expressed in terms of others were removed, namely the &lt;code&gt;##string-nth&lt;/code&gt; and &lt;code&gt;##set-string-nth-fast&lt;/code&gt; instructions. Reading and writing characters from strings can be done by combining simpler memory operations already present in the IR.&lt;/p&gt; &lt;p&gt;A number of instructions were combined. All of the instructions for reading from memory in different formats, &lt;code&gt;##alien-unsigned-1&lt;/code&gt;, &lt;code&gt;##alien-signed-4&lt;/code&gt;, &lt;code&gt;##alien-double&lt;/code&gt;, have been merged into a single &lt;code&gt;##load-memory&lt;/code&gt; instruction which has a constant operands storing the C type and representation of the data being loaded. Similarly, &lt;code&gt;##set-alien-signed-8&lt;/code&gt;, etc have all been merged into &lt;code&gt;##store-memory&lt;/code&gt;. This restructuring allows optimization passes to more easily treat all memory accesses in a uniform way.&lt;/p&gt; &lt;h3&gt;A bug fix for alias analysis&lt;/h3&gt; &lt;p&gt;While working on the above I found an interesting bug in alias analysis. It would incorrectly eliminate loads and stores in certain cases. The optimizer assumed that a pointer to a freshly-allocated object could never alias a pointer read from the heap. However the problem of course is that such a pointer could be stored in another object, and then read back in.&lt;p&gt; &lt;p&gt;Here is a Factor example that demonstrates the problem. First, a couple of tuple definitions; the type declarations and initial value are important further on.&lt;/p&gt; &lt;pre&gt;USING: typed locals ;&lt;br /&gt;&lt;br /&gt;TUPLE: inner { value initial: 5 } ;&lt;br /&gt;TUPLE: outer { value inner } ;&lt;/pre&gt; &lt;p&gt;Now, a very contrieved word which takes two instances of &lt;code&gt;outer&lt;/code&gt;,&lt;br /&gt;mutates one, and reads a value out of the other:&lt;/p&gt; &lt;pre&gt;TYPED: testcase ( x: outer y: outer -- )&lt;br /&gt;    inner new :&gt; z               ! Create a new instance of 'inner'&lt;br /&gt;    z x value&amp;lt;&amp;lt;                  ! Store 'z' in 'x'&lt;br /&gt;    y value&gt;&gt; :&gt; new-z           ! Load 'new-z' from 'y'&lt;br /&gt;    10 new-z value&amp;lt;&amp;lt;             ! Store a value inside 'new-z'  &lt;br /&gt;    z value&gt;&gt;                    ! Read a value out of 'z'&lt;br /&gt;    ;&lt;/pre&gt; &lt;p&gt;Note that the initial value of the &lt;code&gt;value&lt;/code&gt; slot in the &lt;code&gt;inner&lt;/code&gt; tuple is 5, so &lt;code&gt;inner new&lt;/code&gt; creates a new instance holding the value 5. In the case where &lt;code&gt;x&lt;/code&gt; and &lt;code&gt;y&lt;/code&gt; point at the same object, &lt;code&gt;new-z&lt;/code&gt; and &lt;code&gt;z&lt;/code&gt; will point to the same object, and the newly-allocated object's value is set to 10. However, the compiler did not realize this, and erronously constant-folded &lt;code&gt;z value&gt;&gt;&lt;/code&gt; down to 5!&lt;/p&gt; &lt;p&gt;This bug never came up in actual code; the facepalm moment came while I was doing some code refactoring and noticed that the above case was not being handled.&lt;/p&gt; &lt;h3&gt;Instruction scheduling to reduce register pressure&lt;/h3&gt; &lt;p&gt;I'm not the only one who's been busy improving the Factor compiler. One of the &lt;a href="http://useless-factor.blogspot.com"&gt;co-inventors of Factor&lt;/a&gt; has cooked up an &lt;a href="http://useless-factor.blogspot.com/2010/02/instruction-scheduling-for-register.html"&gt;instruction scheduler&lt;/a&gt; and &lt;a href="http://useless-factor.blogspot.com/2010/04/guarded-method-inlining-for-factor.html"&gt;improved method inlining&lt;/a&gt;. The scheduler has been merged, and the it looks like the method inlining improvements should be ready in a few days.&lt;/p&gt; &lt;h3&gt;Some benchmarks&lt;/h3&gt; &lt;p&gt;Here is a comparison between the April 17th build of Factor with the latest code from the GIT repository. I ran a few of the &lt;a href="http://shootout.alioth.debian.org/"&gt;language shootout benchmarks&lt;/a&gt; on a 2.4 GHz MacBook Pro. Factor was built in 32-bit mode, because the new optimizations are most effective on the register-starved x86.&lt;/p&gt; &lt;table&gt; &lt;tr&gt;&lt;th&gt;Benchmark&lt;/th&gt;&lt;th&gt;Before (ms)&lt;/th&gt;&lt;th&gt;After (ms)&lt;/th&gt;&lt;tr&gt;&lt;td&gt;nbody-simd&lt;/td&gt;&lt;td&gt;415&lt;/td&gt;&lt;td&gt;349&lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;fasta&lt;/td&gt;&lt;td&gt;2667&lt;/td&gt;&lt;td&gt;2485&lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;knucleotide&lt;/td&gt;&lt;td&gt;182&lt;/td&gt;&lt;td&gt;177&lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;regex-dna&lt;/td&gt;&lt;td&gt;109&lt;/td&gt;&lt;td&gt;86&lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;spectral-norm&lt;/td&gt;&lt;td&gt;1641&lt;/td&gt;&lt;td&gt;1347&lt;/td&gt;&lt;/tr&gt; &lt;/table&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17087850-5043552812433752925?l=factor-language.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://factor-language.blogspot.com/feeds/5043552812433752925/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17087850&amp;postID=5043552812433752925' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/5043552812433752925'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/5043552812433752925'/><link rel='alternate' type='text/html' href='http://factor-language.blogspot.com/2010/05/collection-of-small-compiler.html' title='A collection of small compiler improvements'/><author><name>Slava Pestov</name><uri>http://www.blogger.com/profile/02768382790667979877</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17087850.post-5740682451340469020</id><published>2010-04-16T18:24:00.004-04:00</published><updated>2010-04-17T01:01:56.876-04:00</updated><title type='text'>Factor 0.93 now available</title><content type='html'>&lt;p&gt;After two months of development, Factor 0.93 is now available for download from the &lt;a href="http://factorcode.org"&gt;Factor website&lt;/a&gt;. A big thanks to all the contributors, testers and users. This release would not be possible without your help. A summary of the most important changes follows:&lt;/p&gt;&lt;h3&gt;Incompatible changes:&lt;/h3&gt; &lt;ul&gt; &lt;li&gt;Factor no longer supports NetBSD, due to limitations in that operating system. (Slava Pestov)&lt;/li&gt; &lt;li&gt;&lt;a href="http://docs.factorcode.org/content/article-sets.html"&gt;sets&lt;/a&gt;, &lt;a href="http://docs.factorcode.org/content/article-hash-sets.html"&gt;hash-sets&lt;/a&gt;, &lt;a href="http://docs.factorcode.org/content/article-bit-sets.html"&gt;bit-sets&lt;/a&gt;: these vocabularies have been redesigned around a new &lt;a href="http://useless-factor.blogspot.com/2010/03/protocol-for-sets.html"&gt;generic protocol for sets&lt;/a&gt; (Daniel Ehrenberg) &lt;li&gt;Strings can no longer be used to refer to C types in FFI calls; you must use C type words instead. Also, ABI specifiers are now symbols rather than strings. So for example, the following code will no longer work: &lt;pre&gt;: my-callback ( -- callback ) "int" { "int" "int" } "cdecl" [ + ] alien-callback ;&lt;/pre&gt; you must now write: &lt;pre&gt;: my-callback ( -- callback ) int { int int } cdecl [ + ] alien-callback ;&lt;/pre&gt; (Joe Groff)&lt;/li&gt; &lt;li&gt;The behavior of string types has changed. The &lt;code&gt;char*&lt;/code&gt; C type is now just a bare pointer; to get the automatic conversion to and from Factor strings, use the &lt;code&gt;c-string&lt;/code&gt; type. See &lt;a href="http://docs.factorcode.org/content/article-c-strings.html"&gt;the documentation&lt;/a&gt; for details. (Joe Groff)&lt;/li&gt; &lt;li&gt;FFI function return values which are pointers to structs are now boxed in a struct class, instead of returning a bare alien. This means that many &lt;code&gt;memory&gt;struct&lt;/code&gt; calls can be removed. If the return value is actually a pointer to an array of structs, use specialized arrays as before. (Joe Groff)&lt;/li&gt; &lt;li&gt;C-ENUM: now takes an additional parameter which is either the name of a new C type to define, or &lt;code&gt;f&lt;/code&gt;. This type is aliased to &lt;code&gt;int&lt;/code&gt;. (Erik Charlebois)&lt;/li&gt; &lt;li&gt;&lt;a href="http://duriansoftware.com/joe/Improving-Factor's-error-messages.html"&gt;The stack checker now supports row polymorphic stack effects&lt;/a&gt;, for improved error checking. See &lt;a href="http://docs.factorcode.org/content/article-effects-variables.html"&gt;the documentation&lt;/a&gt; for details. (Joe Groff)&lt;/li&gt; &lt;/ul&gt; &lt;h3&gt;New features:&lt;/h3&gt; &lt;ul&gt; &lt;li&gt;Co-operative threads now use efficient context-switching primitives instead of copying stacks with continuations (Slava Pestov)&lt;/li&gt; &lt;li&gt;Added &lt;a href="http://docs.factorcode.org/content/word-final%2Csyntax.html"&gt;final class declarations&lt;/a&gt;, to prohibit classes from being subclassed. This enables &lt;a href="http://factor-language.blogspot.com/2010/02/final-classes-and-platform-specific.html"&gt;a few compiler optimizations&lt;/a&gt; (Slava Pestov)&lt;/li&gt; &lt;li&gt;Added &lt;a href="http://factor-language.blogspot.com/2010/02/final-classes-and-platform-specific.html"&gt;platform support metadata&lt;/a&gt; for vocabularies. Vocabulary directories can now contain a &lt;code&gt;platforms.txt&lt;/code&gt; file listing operating system names which they can be loaded under. (Slava Pestov)&lt;/li&gt; &lt;li&gt;The deploy tool can now deploy &lt;a href="http://docs.factorcode.org/content/article-loading-libs.html"&gt;native libraries&lt;/a&gt; and &lt;a href="http://docs.factorcode.org/content/article-deploy-resources.html"&gt;resource files&lt;/a&gt;, and supports &lt;a href="http://docs.factorcode.org/content/article-vocabs.icons.html"&gt;custom icons&lt;/a&gt;. (Joe Groff)&lt;/li&gt; &lt;li&gt;&lt;a href="http://useless-factor.blogspot.com/2010/03/expressing-joint-behavior-of-modules.html"&gt;Joint behavior of vocabularies&lt;/a&gt; - the new &lt;a href="http://docs.factorcode.org/content/word-require-when%2Cvocabs.loader.html"&gt;require-when&lt;/a&gt; word can be used to express that a vocabulary should be loaded if two existing vocabularies have been loaded (Daniel Ehrenberg)&lt;/li&gt; &lt;li&gt;Callstack overflow is now caught and reported on all platforms except for Windows and OpenBSD. (Slava Pestov)&lt;/li&gt; &lt;li&gt;The prettyprinter now has some limits set by default. This prevents out of memory errors when printing large structures, such as the global namespace. Use the &lt;a href="http://docs.factorcode.org/content/word-without-limits%2Cprettyprint.config.html"&gt;without-limits&lt;/a&gt; combinator to disable limits. (Slava Pestov)&lt;/li&gt; &lt;li&gt;Added &lt;a href="http://docs.factorcode.org/content/word-fastcall%2Calien.html"&gt;fastcall&lt;/a&gt; and &lt;a href="http://docs.factorcode.org/content/word-thiscall%2Calien.html"&gt;thiscall&lt;/a&gt; ABIs on x86-32 (Joe Groff)&lt;/li&gt; &lt;li&gt;The build farm's version release process is now more automated (Slava Pestov)&lt;/li&gt; &lt;/ul&gt; Improved libraries: &lt;ul&gt; &lt;li&gt;&lt;a href="http://docs.factorcode.org/content/vocab-delegate.html"&gt;delegate&lt;/a&gt;: add &lt;a href="http://docs.factorcode.org/content/word-BROADCAST__colon__%2Cdelegate.html"&gt;BROADCAST:&lt;/a&gt; syntax, which delegates a generic word with no outputs to an array of multiple delegates. (Joe Groff)&lt;/li&gt; &lt;li&gt;&lt;a href="http://docs.factorcode.org/content/vocab-game.input.html"&gt;game.input&lt;/a&gt;: X11 support. (William Schlieper)&lt;/li&gt; &lt;li&gt;&lt;a href="http://docs.factorcode.org/content/vocab-gpu.html"&gt;gpu&lt;/a&gt;: geometry shader support. (Joe Groff)&lt;/li&gt; &lt;li&gt;&lt;a href="http://docs.factorcode.org/content/vocab-opengl.gl.html"&gt;opengl&lt;/a&gt;: OpenGL 4.0 support. (Joe Groff)&lt;/li&gt; &lt;/ul&gt; New libraries: &lt;ul&gt; &lt;li&gt;&lt;a href="http://docs.factorcode.org/content/vocab-bit.ly.html"&gt;bit.ly&lt;/a&gt;: Factor interface to the &lt;a href="http://bit.ly"&gt;bit.ly&lt;/a&gt; URL shortening service. (Slava Pestov)&lt;/li&gt; &lt;li&gt;&lt;a href="http://docs.factorcode.org/content/vocab-chipmunk.html"&gt;chipmunk&lt;/a&gt;: binding for Chipmunk 2D physics library. (Erik Charlebois)&lt;/li&gt; &lt;li&gt;&lt;a href="http://docs.factorcode.org/content/vocab-cuda.html"&gt;cuda&lt;/a&gt;: binding to NVidia's CUDA API for GPU computing. (Doug Coleman, Joe Groff)&lt;/li&gt; &lt;li&gt;&lt;a href="http://docs.factorcode.org/content/vocab-cursors.html"&gt;cursors&lt;/a&gt;: experimental library for iterating over collections, inspired by STL iterators. (Joe Groff)&lt;/li&gt; &lt;li&gt;&lt;a href="http://docs.factorcode.org/content/vocab-elf.html"&gt;elf&lt;/a&gt;: parsing ELF binaries. The &lt;a href="http://docs.factorcode.org/content/vocab-elf.nm.html"&gt;elf.nm&lt;/a&gt; vocabulary is a demo which prints all symbols in an ELF binary. (Erik Charlebois)&lt;/li&gt; &lt;li&gt;&lt;a href="http://docs.factorcode.org/content/vocab-images.pbm.html"&gt;images.pbm&lt;/a&gt;, &lt;a href="http://docs.factorcode.org/content/vocab-images.pgm.html"&gt;images.pgm&lt;/a&gt;, &lt;a href="http://docs.factorcode.org/content/vocab-ppm.html"&gt;images.ppm&lt;/a&gt;: libraries for loading PBM, PGM and PPM images (Erik Charlebois)&lt;/li&gt; &lt;li&gt;&lt;a href="http://docs.factorcode.org/content/vocab-macho.html"&gt;macho&lt;/a&gt;: parsing Mach-O binaries. (Erik Charlebois)&lt;/li&gt; &lt;li&gt;&lt;a href="http://docs.factorcode.org/content/vocab-opencl.html"&gt;opencl&lt;/a&gt;: binding for the OpenCL standard interface for GPU computing. (Erik Charlebois)&lt;/li&gt; &lt;li&gt;&lt;a href="http://docs.factorcode.org/content/vocab-path-finding.html"&gt;path-finding&lt;/a&gt;: implementation of A* and BFS path finding algorithms. (Samuel Tardieu)&lt;/li&gt; &lt;li&gt;&lt;a href="http://docs.factorcode.org/content/vocab-windows.ddk.html"&gt;windows.ddk&lt;/a&gt;: binding for Windows hardware abstraction layer. (Erik Charlebois)&lt;/li&gt; &lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17087850-5740682451340469020?l=factor-language.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://factor-language.blogspot.com/feeds/5740682451340469020/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17087850&amp;postID=5740682451340469020' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/5740682451340469020'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/5740682451340469020'/><link rel='alternate' type='text/html' href='http://factor-language.blogspot.com/2010/04/factor-093-now-available.html' title='Factor 0.93 now available'/><author><name>Slava Pestov</name><uri>http://www.blogger.com/profile/02768382790667979877</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17087850.post-7929495712930034237</id><published>2010-04-08T20:16:00.002-04:00</published><updated>2010-04-09T14:54:28.047-04:00</updated><title type='text'>Frame-based structured exception handling on Windows x86-64</title><content type='html'>&lt;p&gt;Factor used to use vectored exception handlers, registered with &lt;a href="http://msdn.microsoft.com/en-us/library/ms680588(VS.85).aspx"&gt;AddVectoredExceptionHandler&lt;/a&gt;, however vectored handlers are somewhat problematic. A vectored handler is always called prior to any frame-based handlers, so Factor could end up reporting bogus exceptions if the FFI is used to call a library that uses SEH internally. This prompted me to switch to frame-based exception handlers. Unfortunately, these are considerably more complex to use, and the implementation differs between 32-bit and 64-bit Windows.&lt;/p&gt; &lt;p&gt;I briefly discussed frame-based SEH on 32-bit Windows in my &lt;a href="http://factor-language.blogspot.com/2010/04/switching-call-stacks-on-different.html"&gt;previous post&lt;/a&gt;. During the switch to 64 bits, Microsoft got rid of the old frame-based SEH implementation and introduced a new, lower-overhead approach. Instead of pushing and popping exception handlers onto a linked list at runtime, the system maintains a set of function tables, where each function table stores exception handling and stack frame unwinding information.&lt;/p&gt; &lt;p&gt;Normally, the 64-bit Windows function tables are written into the executable by the native compiler. However, language implementations which generate code at runtime need to be able to define new function tables dynamically. This is done with the &lt;a href="http://msdn.microsoft.com/en-us/library/ms680588(VS.85).aspx"&gt;RtlAddFunctionTable()&lt;/a&gt; function.&lt;/p&gt; &lt;p&gt;It took me a while to figure out the correct way to call this function. I found the &lt;a href="http://hg.openjdk.java.net/jdk7/jsn/hotspot/file/d9bc824aa078/src/os_cpu/windows_x86/vm/os_windows_x86.cpp"&gt;os_windows_x86.cpp&lt;/a&gt; source file from Sun's HotSpot Java implementation was very helpful, and I based my code on the &lt;code&gt;register_code_area()&lt;/code&gt; function from this file.&lt;/p&gt; &lt;p&gt;Factor and HotSpot only use function tables in a very simple manner, to set up an exception handler. Function tables can also be used to define stack unwinding behavior; this allows debuggers to generate backtraces, and so on. Doing that is more complicated and I don't understand how it works, so I won't attempt to discuss it here.&lt;/p&gt; &lt;p&gt;The &lt;code&gt;RtlAddFunctionTable()&lt;/code&gt; function takes an array of &lt;code&gt;RUNTIME_FUNCTION&lt;/code&gt; structures and a base address. For some unknown reason, all pointers in the structures passed to this function are 32-bit integers relative to the base address.&lt;/p&gt; &lt;p&gt;For a runtime compiler that does not need to perform unwinding, it suffices to map the entire code heap to one &lt;code&gt;RUNTIME_FUNCTION&lt;/code&gt;. A &lt;code&gt;RUNTIME_FUNCTION&lt;/code&gt; has three fields:&lt;/p&gt; &lt;ul&gt; &lt;li&gt;&lt;code&gt;BeginAddress&lt;/code&gt; - the start of the function&lt;/li&gt; &lt;li&gt;&lt;code&gt;EndAddress&lt;/code&gt; - the end of the function&lt;/li&gt; &lt;li&gt;&lt;code&gt;UnwindData&lt;/code&gt; - a pointer to unwind data&lt;/li&gt; &lt;/ul&gt; &lt;p&gt;All pointers are relative to the base address passed into &lt;code&gt;RtlAddFunctionTable()&lt;/code&gt;. The unwind data can take various forms. For the simple case of no unwind codes and an exception handler, the following structure is used:&lt;/p&gt; &lt;pre&gt;struct UNWIND_INFO {&lt;br /&gt;    UBYTE Version:3;&lt;br /&gt;    UBYTE Flags:5;&lt;br /&gt;    UBYTE SizeOfProlog;&lt;br /&gt;    UBYTE CountOfCodes;&lt;br /&gt;    UBYTE FrameRegister:4;&lt;br /&gt;    UBYTE FrameOffset:4;&lt;br /&gt;    ULONG ExceptionHandler;&lt;br /&gt;    ULONG ExceptionData[1];&lt;br /&gt;};&lt;/pre&gt; &lt;p&gt;The &lt;code&gt;Version&lt;/code&gt; and &lt;code&gt;Flags&lt;/code&gt; fields should be set to 1, the &lt;code&gt;ExceptionHandler&lt;/code&gt; field set to a function pointer, and the rest of the fields set to 0. The exception handler pointer must be within relative to the base address, and it must also be within the memory range specified by the &lt;code&gt;BeginAddress&lt;/code&gt; and &lt;code&gt;EndAddress&lt;/code&gt; fields of the &lt;code&gt;RUNTIME_FUNCTION&lt;/code&gt; structure. The exception handler has the same function signature as in the 32-bit SEH case:&lt;/p&gt; &lt;pre&gt;LONG exception_handler(PEXCEPTION_RECORD e, void *frame, PCONTEXT c, void *dispatch)&lt;/pre&gt; &lt;p&gt;In both Factor and HotSpot, the exception handler is a C function, however the &lt;code&gt;RtlAddFunctionTable()&lt;/code&gt; API requires that it be within the bounds of the runtime code heap. To get around the restriction, both VMs allocate a small trampoline in the code heap which simply jumps to the exception handler, and use a pointer to the trampoline instead. Similarly, because the "pointers" in these structures are actually 32-bit integers, it helps to allocate the &lt;code&gt;RUNTIME_FUNCTION&lt;/code&gt; and &lt;code&gt;UNWIND_INFO&lt;/code&gt; in the code heap as well, to ensure that everything is within the same 4Gb memory range.&lt;/p&gt;&lt;p&gt;The above explanation probably didn't make much sense, so I invite you to check out the source code instead: &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=vm/os-windows-nt-x86.64.cpp;hb=HEAD"&gt;os-windows-nt-x86.64.cpp&lt;/a&gt;.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17087850-7929495712930034237?l=factor-language.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://factor-language.blogspot.com/feeds/7929495712930034237/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17087850&amp;postID=7929495712930034237' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/7929495712930034237'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/7929495712930034237'/><link rel='alternate' type='text/html' href='http://factor-language.blogspot.com/2010/04/frame-based-structured-exception.html' title='Frame-based structured exception handling on Windows x86-64'/><author><name>Slava Pestov</name><uri>http://www.blogger.com/profile/02768382790667979877</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17087850.post-3795058133614123171</id><published>2010-04-04T16:24:00.005-04:00</published><updated>2010-04-04T16:35:09.363-04:00</updated><title type='text'>Switching call stacks on different platforms</title><content type='html'>&lt;p&gt;User-space implementations of coroutines and green threads need to be able to switch the CPU's call stack pointer between different memory regions. Since this is inherently CPU- and OS-specific, I'll limit my discussion to CPUs and platforms that Factor supports.&lt;/p&gt; &lt;h3&gt;System APIs for switching call stacks&lt;/h3&gt; &lt;ul&gt; &lt;li&gt;Windows has an API for creating and switching between contexts, which it calls "fibers". The main functions to look at are &lt;a href="http://msdn.microsoft.com/en-us/library/ms682402(VS.85).aspx"&gt;CreateFiber()&lt;/a&gt; and &lt;a href="http://msdn.microsoft.com/en-us/library/ms686350(VS.85).aspx"&gt;SwitchToFiber()&lt;/a&gt;. On Windows, by far the easiest way to switch call stacks is to just use fibers.&lt;/li&gt; &lt;li&gt;Most Unix systems have a set of functions operating on &lt;code&gt;ucontext&lt;/code&gt;s, such as &lt;code&gt;makecontext()&lt;/code&gt;, &lt;code&gt;swapcontext()&lt;/code&gt; and so on. On some systems, these APIs are poorly implemented and documented.&lt;/li&gt; &lt;li&gt;The C standard library functions &lt;code&gt;setjmp()&lt;/code&gt; and &lt;code&gt;longjmp()&lt;/code&gt; can be (ab)used to switch contexts. The &lt;code&gt;jmpbuf&lt;/code&gt; structure stored to by &lt;code&gt;setjmp()&lt;/code&gt; and read from by &lt;code&gt;longjmp()&lt;/code&gt; contains a snapshot of all registers, including the call stack pointer. Once you've allocated your own call stack, you can capture a &lt;code&gt;jmp_buf&lt;/code&gt; with &lt;code&gt;setjmp()&lt;/code&gt;, change the call stack pointer, and then resume execution with &lt;code&gt;longjmp()&lt;/code&gt;. As far as I know, this is completely undocumented and unsupported with every major C compiler.&lt;/li&gt; &lt;li&gt;The most direct way is to write some assembly to switch the relevant registers directly. Paradoxically, this approach is the most portable out of all of these, because while it is CPU-specific, the details are pretty much the same across most OSes, Windows being the exception. Switching call stacks on Windows is a bit more involved than just changing the &lt;code&gt;ESP&lt;/code&gt; register, and requires updating some Windows-specific in-memory structures. However, it is still easy enough to do directly, if you do not wish to use the fiber API. I will describe the details at the end of this post.&lt;/li&gt; &lt;/ul&gt; &lt;h3&gt;High-level libraries for switching call stacks&lt;/h3&gt; &lt;p&gt;A couple of existing libraries implement high-level, cross-platform abstractions using some combination of the above mechanisms.&lt;/p&gt; &lt;ul&gt; &lt;li&gt;&lt;a href="http://www.dekorte.com/projects/opensource/libcoroutine/"&gt;libcoroutine&lt;/a&gt; uses fibers on Windows, and ucontext and longjmp on Unix.&lt;/li&gt; &lt;li&gt;&lt;a href="http://software.schmorp.de/pkg/libcoro.html"&gt;libcoro&lt;/a&gt; uses a handful of Unix-specific APIs if they're available, or falls back onto some hand-coded assembly routines. The latter works on Windows.&lt;/li&gt; &lt;/ul&gt; &lt;h3&gt;NetBSD/x86 limitation&lt;/h3&gt; &lt;p&gt;There is no realiable way to switch call stacks on NetBSD/x86, because of a limitation in the implementation of &lt;code&gt;libpthread&lt;/code&gt;. Even if your program doesn't use pthreads, it might link with a library that does, such as Glib. The pthread library replaces various C standard library functions with its own versions, and since this includes some extremely common calls such as &lt;code&gt;malloc()&lt;/code&gt;, almost nothing will work as a result.&lt;/p&gt; &lt;p&gt;The reason is that the NetBSD pthread library uses a silly trick to implement thread-local storage, one which unfortunately assumes that every native thread has exactly one call stack. The trick is to always allocate thread call stacks on multiples of the maximum call stack size. Then, each thread's unique ID can just be the result of masking the call stack pointer by the maximum call stack size.&lt;/p&gt; &lt;h3&gt;Windows-specific concerns&lt;/h3&gt; &lt;p&gt;So far, I've only worked out the details of switching call stacks on 32-bit Windows. I'll talk about 64-bit Windows in another post.&lt;/p&gt; &lt;p&gt;Windows has a mechanism known as &lt;a href="http://msdn.microsoft.com/en-us/library/ms680657(VS.85).aspx"&gt;Structured Exception Handling&lt;/a&gt;, which is a sort of scoped exception handling mechanism with kernel support. Windows uses SEH to deliver processor traps to user-space applications (illegal memory access, division by zero, etc). Some C++ compilers also use SEH to implement C++ exceptions.&lt;/p&gt; &lt;p&gt;On Windows, simply changing the &lt;code&gt;ESP&lt;/code&gt; register to point at another memory region is insufficient because structured exception handling must know the current call stack's beginning and end, as well as a pointer to the innermost exception handler record.&lt;/p&gt; &lt;p&gt;These three pieces of information are stored in a per-thread information block, known as the TIB. The fiber switching APIs update the TIB for you, which is why Microsoft recommends their use.&lt;/p&gt; &lt;h4&gt;The TIB&lt;/h4&gt; &lt;p&gt;The TIB is defined by the &lt;code&gt;NT_TIB&lt;/code&gt; struct in &lt;code&gt;winnt.h&lt;/code&gt;:&lt;/p&gt; ﻿&lt;pre&gt;typedef struct _NT_TIB {&lt;br /&gt;    struct _EXCEPTION_REGISTRATION_RECORD *ExceptionList;&lt;br /&gt;    PVOID StackBase;&lt;br /&gt;    PVOID StackLimit;&lt;br /&gt;    PVOID SubSystemTib;&lt;br /&gt;    _ANONYMOUS_UNION union {&lt;br /&gt;        PVOID FiberData;&lt;br /&gt;        DWORD Version;&lt;br /&gt;    } DUMMYUNIONNAME;&lt;br /&gt;    PVOID ArbitraryUserPointer;&lt;br /&gt;    struct _NT_TIB *Self;&lt;br /&gt;} NT_TIB,*PNT_TIB;&lt;/pre&gt; &lt;p&gt;The relevant fields that must be updated when switching call stacks are &lt;code&gt;ExceptionList&lt;/code&gt;, &lt;code&gt;StackBase&lt;/code&gt; and &lt;code&gt;StackLimit&lt;/code&gt;.&lt;/p&gt; Note that since the x86 call stack grows down, &lt;code&gt;StackLimit&lt;/code&gt; is the start of the call stack's memory region and &lt;code&gt;StackBase&lt;/code&gt; is the end.&lt;/p&gt; &lt;h4&gt;Structured exception handlers&lt;/h4&gt; &lt;p&gt;Structured exception handlers are stored in a linked list on the call stack, with the head of the list pointed at by the &lt;code&gt;ExceptionList&lt;/code&gt; field of the TIB:&lt;/p&gt; &lt;pre&gt;struct _EXCEPTION_REGISTRATION_RECORD&lt;br /&gt;{&lt;br /&gt;     PEXCEPTION_REGISTRATION_RECORD Next;&lt;br /&gt;     PEXCEPTION_DISPOSITION Handler;&lt;br /&gt;} EXCEPTION_REGISTRATION_RECORD, *PEXCEPTION_REGISTRATION_RECORD;&lt;/pre&gt; &lt;p&gt;The exception list should be saved and restored when switching between call stacks, and a new call stack should begin with an empty list (&lt;code&gt;ExceptionList&lt;/code&gt; set to &lt;code&gt;NULL&lt;/code&gt;).&lt;/p&gt; &lt;p&gt;If the &lt;code&gt;ExceptionList&lt;/code&gt; field of the TIB or the &lt;code&gt;Next&lt;/code&gt; field of an exception record point outside the call stack, then the handler in question will not be called at all.&lt;/p&gt; &lt;h4&gt;Accessing the TIB&lt;/h4&gt; &lt;p&gt;On x86-32, the current thread's TIB is stored starting at address 0 in the segment pointed at by the FS segment register. The &lt;code&gt;Self&lt;/code&gt; field always points at the struct itself. Since Windows uses flat addressing, the segment storing the TIB begins somewhere in the linear 32-bit address space, so an assembly routine to fetch the TIB and return a pointer to it in EAX looks like this:&lt;/p&gt; &lt;pre&gt;fetch_tib:&lt;br /&gt;    mov eax,[fs:24]&lt;br /&gt;    ret&lt;/pre&gt; &lt;p&gt;This assembly routine can then be called from C code, which can manipulate the TIB like any other structure.&lt;/p&gt;&lt;h4&gt;Sample code for updating Windows-specific structures&lt;/h4&gt;&lt;p&gt;The &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=basis/cpu/x86/32/winnt/bootstrap.factor;hb=HEAD"&gt;basis/cpu/x86/32/winnt/bootstrap.factor&lt;/a&gt; file defines assembly subroutines for updating the TIB when switching call stacks. These routines are invoked by the x86-32 non-optimizing compiler backend:&lt;/p&gt; &lt;ul&gt; &lt;li&gt;&lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=basis/cpu/x86/bootstrap.factor;hb=HEAD"&gt;basis/cpu/x86/bootstrap.factor&lt;/a&gt;&lt;/li&gt; &lt;li&gt;&lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=basis/cpu/x86/32/bootstrap.factor;hb=HEAD"&gt;basis/cpu/x86/32/bootstrap.factor&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17087850-3795058133614123171?l=factor-language.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://factor-language.blogspot.com/feeds/3795058133614123171/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17087850&amp;postID=3795058133614123171' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/3795058133614123171'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/3795058133614123171'/><link rel='alternate' type='text/html' href='http://factor-language.blogspot.com/2010/04/switching-call-stacks-on-different.html' title='Switching call stacks on different platforms'/><author><name>Slava Pestov</name><uri>http://www.blogger.com/profile/02768382790667979877</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17087850.post-6584784544302650530</id><published>2010-03-23T04:29:00.004-04:00</published><updated>2010-03-23T04:41:46.088-04:00</updated><title type='text'>Incompatible change in Mach exception handler behavior on Mac OS X 10.6</title><content type='html'>&lt;p&gt;On Mac OS X, Factor uses Mach exceptions rather than Unix signals to receive notifications of illegal memory accesses and arithmetic exceptions from the operating system. This is used to catch datastack underflows, division by zero, and the like. The code implementing this can be found in &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=vm/mach_signal.cpp;hb=HEAD"&gt;vm/mach_signal.c&lt;/a&gt; in the Factor repository. This file is based on code from the &lt;a href="http://libsigsegv.sourceforge.net/"&gt;GNU libsigsegv&lt;/a&gt; project, with &lt;a href="http://sourceforge.net/mailarchive/message.php?msg_name=200503102200.32002.bruno%40clisp.org"&gt;special permission&lt;/a&gt; to use it under a BSD license instead of the restrictive GPL.&lt;/p&gt;&lt;p&gt;It seems that as of Mac OS X 10.6, exceptions raised by child processes are now reported to the parent if the parent has an exception handler thread. This caused a problem in Factor if a child process crashed with an access violation; Factor would think the access violation occurred in Factor code, and die with an assertion failure when attempting to map the thread ID back into a Factor VM object. It seems the simplest way is to add the following clause to the &lt;code&gt;catch_exception_raise()&lt;/code&gt; function:&lt;/p&gt;&lt;pre&gt;if(task != mach_task_self()) return KERN_FAILURE;&lt;/pre&gt;&lt;p&gt;This fix should be applicable to the original &lt;code&gt;libsigsegv&lt;/code&gt; code as well, along with other projects, such as CLisp, that use this library.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17087850-6584784544302650530?l=factor-language.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://factor-language.blogspot.com/feeds/6584784544302650530/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17087850&amp;postID=6584784544302650530' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/6584784544302650530'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/6584784544302650530'/><link rel='alternate' type='text/html' href='http://factor-language.blogspot.com/2010/03/change-in-mach-exception-handler.html' title='Incompatible change in Mach exception handler behavior on Mac OS X 10.6'/><author><name>Slava Pestov</name><uri>http://www.blogger.com/profile/02768382790667979877</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17087850.post-5649626009483050415</id><published>2010-03-11T03:16:00.005-05:00</published><updated>2010-03-11T03:55:58.344-05:00</updated><title type='text'>Adding a recaptcha to the Factor pastebin</title><content type='html'>&lt;p&gt;Lately the &lt;a href="http://paste.factorcode.org"&gt;Factor pastebin&lt;/a&gt; has been flooded with penis enlargement spam. The pastebin had its own very weak captcha that worked well enough for a while: there was a form field that was required to be left blank, and if it was not blank validation would fail. However, spammers have started working around it so we needed something stronger. I decided to solve the problem once and for all by integrating &lt;a href="http://code-factor.blogspot.com"&gt;Doug Coleman&lt;/a&gt;'s &lt;a href="http://docs.factorcode.org/content/article-furnace.recaptcha.html"&gt;furnace.recaptcha&lt;/a&gt; vocabulary into the pastebin. Doing this was surprisingly easy.&lt;/p&gt;&lt;p&gt;First, I changed the &lt;code&gt;USING:&lt;/code&gt; form in &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=extra/webapps/pastebin/pastebin.factor;hb=HEAD"&gt;webapps.pastebin&lt;/a&gt; to reference the &lt;code&gt;furnace.recaptcha&lt;/code&gt; vocabulary:&lt;/p&gt;&lt;pre&gt;USING: namespaces assocs sorting sequences kernel accessors&lt;br /&gt;hashtables db.types db.tuples db combinators&lt;br /&gt;calendar calendar.format math.parser math.order syndication urls&lt;br /&gt;xml.writer xmode.catalog validators&lt;br /&gt;html.forms&lt;br /&gt;html.components&lt;br /&gt;html.templates.chloe&lt;br /&gt;http.server&lt;br /&gt;http.server.dispatchers&lt;br /&gt;http.server.redirection&lt;br /&gt;http.server.responses&lt;br /&gt;furnace&lt;br /&gt;furnace.actions&lt;br /&gt;furnace.redirection&lt;br /&gt;furnace.auth&lt;br /&gt;furnace.auth.login&lt;br /&gt;furnace.boilerplate&lt;br /&gt;furnace.recaptcha&lt;br /&gt;furnace.syndication&lt;br /&gt;furnace.conversations ;&lt;/pre&gt;&lt;p&gt;Next, I changed the &lt;code&gt;validate-entity&lt;/code&gt; word. This word is used to validate a form submission when adding a new paste or a new annotation. Instead of validating a captcha using the old simple scheme, it now calls &lt;code&gt;validate-recaptcha&lt;/code&gt;:&lt;/p&gt;&lt;pre&gt;: validate-entity ( -- )&lt;br /&gt;    {&lt;br /&gt;        { "summary" [ v-one-line ] }&lt;br /&gt;        { "author" [ v-one-line ] }&lt;br /&gt;        { "mode" [ v-mode ] }&lt;br /&gt;        { "contents" [ v-required ] }&lt;br /&gt;    } validate-params&lt;br /&gt;    validate-recaptcha ;&lt;/pre&gt;&lt;p&gt;Finally, I edited the &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=extra/webapps/pastebin/new-paste.xml;hb=HEAD"&gt;new-paste.xml&lt;/a&gt; and &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=extra/webapps/pastebin/new-annotation.xml;hb=HEAD"&gt;new-annotation.xml&lt;/a&gt; templates to add a recaptcha component inside the &lt;code&gt;t:form&lt;/code&gt; tag:&lt;/p&gt;&lt;pre&gt;&amp;lt;tr&gt;&amp;lt;td colspan="2"&gt;&amp;lt;t:recaptcha /&gt;&amp;lt;/td&gt;&amp;lt;/tr&gt;&lt;/pre&gt;&lt;p&gt;The implementation of the &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=basis/furnace/recaptcha/recaptcha.factor;hb=HEAD"&gt;furnace.recaptcha&lt;/a&gt; vocabulary is very straightforward and takes advantage of several features of Factor's web framework. It uses &lt;a href="http://docs.factorcode.org/content/article-http.client.html"&gt;http.client&lt;/a&gt; to communicate with the recaptcha server, and &lt;a href="http://docs.factorcode.org/content/article-furnace.conversations.html"&gt;furnace.conversations&lt;/a&gt; to pass the validation error between requests. Finally, it uses the &lt;code&gt;CHLOE:&lt;/code&gt; parsing word to define the &lt;code&gt;t:recaptcha&lt;/code&gt; tag for use in templates:&lt;/p&gt;&lt;pre&gt;CHLOE: recaptcha drop [ render-recaptcha ] [xml-code] ;&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17087850-5649626009483050415?l=factor-language.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://factor-language.blogspot.com/feeds/5649626009483050415/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17087850&amp;postID=5649626009483050415' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/5649626009483050415'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/5649626009483050415'/><link rel='alternate' type='text/html' href='http://factor-language.blogspot.com/2010/03/adding-recaptcha-to-factor-pastebin.html' title='Adding a recaptcha to the Factor pastebin'/><author><name>Slava Pestov</name><uri>http://www.blogger.com/profile/02768382790667979877</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17087850.post-646241296611545719</id><published>2010-02-24T02:35:00.006-05:00</published><updated>2010-02-24T03:27:29.644-05:00</updated><title type='text'>Final classes and platform-specific vocabularies</title><content type='html'>&lt;h3&gt;Final classes&lt;/h3&gt; &lt;p&gt;I added final classes, in the Java sense. Attempting to inherit from a final class raises an error. To declare a class as final, suffix the definition with "final", in the same manner that words can be declared "inline" or "foldable":&lt;/p&gt; &lt;pre&gt;TUPLE: point x y z ; final&lt;/pre&gt; &lt;p&gt;The motivation for final classes was not obvious. There are three main reasons I added this feature.&lt;/p&gt; &lt;p&gt;&lt;a href="http://factor-language.blogspot.com/2009/05/unboxed-tuple-arrays-in-factor.html"&gt;Unboxed tuple arrays&lt;/a&gt; used to have the caveat that if you store an instance of a subclass into a tuple array of a superclass, then the slots of the subclass would be "sliced off":&lt;/p&gt; &lt;pre&gt;TUPLE: point-2d x y ;&lt;br /&gt;&lt;br /&gt;TUPLE-ARRAY: point-2d&lt;br /&gt;&lt;br /&gt;TUPLE: point-3d &lt; point-2d z ;&lt;br /&gt;&lt;br /&gt;SYMBOL: my-array&lt;br /&gt;&lt;br /&gt;1 &lt;point-2d&gt; my-array set&lt;br /&gt;&lt;br /&gt;1 2 3 point-3d boa my-array get set-first&lt;br /&gt;&lt;br /&gt;my-array get first .&lt;br /&gt;=&gt; T{ point-2d { x 1 } { y 2 } }&lt;/pre&gt; &lt;p&gt;This warranted a paragraph in the documentation, and vigilance on the part of the programmer. Now, tuple arrays simply enforce that the element class is final, and if it is not, an error is raised. This removes a potential source of confusion; it is always nice when written warnings in the documentation can be replaced by language features.&lt;/p&gt; &lt;p&gt;Joe Groff's &lt;a href="http://docs.factorcode.org/content/article-typed.html"&gt;typed&lt;/a&gt; library has a similar problem. This library has a feature where input and output parameters which are read-only tuples are passed by value to improve performance. This could cause the same "slicing problem" as above. Now, &lt;code&gt;typed&lt;/code&gt; only passes final read-only tuples by value.&lt;/p&gt; &lt;p&gt;Finally, there was a previous mechanism for prohibiting subclassing, but it wasn't exposed as part of the syntax. It was used by the implementation of &lt;a href="http://docs.factorcode.org/content/article-classes.struct.html"&gt;struct classes&lt;/a&gt; to prevent subclassing of structs. The struct class implementation now simply declares struct classes as final.&lt;/p&gt; &lt;h3&gt;Typed words can now be declared foldable and flushable&lt;/h3&gt; &lt;p&gt;Factor has a pretty unique feature; the user can declare words &lt;a href="http://docs.factorcode.org/content/word-foldable%2Csyntax.html"&gt;foldable&lt;/a&gt; (which makes them eligible for constant folding at compile time if the inputs are all literal) or &lt;a href="http://docs.factorcode.org/content/word-flushable%2Csyntax.html"&gt;flushable&lt;/a&gt; (which makes them eligible for dead code elimination if the output results are not used). These declarations now work with typed words.&lt;/p&gt; &lt;h3&gt;Platform-specific vocabularies&lt;/h3&gt; &lt;p&gt;I added a facility to declare what operating systems a vocabulary runs on. Loading a vocabulary on an unsupported platform raises an error, with a restart if you know what you're doing. The &lt;code&gt;load-all&lt;/code&gt; word skips vocabularies which are not supported by the current platform.&lt;/p&gt; &lt;p&gt;If a &lt;code&gt;platforms.txt&lt;/code&gt; file exists in the vocabulary's directory, this file is interpreted as a newline-separated list of operating system names from &lt;a href="http://docs.factorcode.org/content/vocab-system.html"&gt;the system vocabulary&lt;/a&gt;. This complements the existing &lt;a href="http://docs.factorcode.org/content/article-vocabs.metadata.html"&gt;vocabulary metadata&lt;/a&gt; for authors, tags, and summary.&lt;/p&gt; &lt;p&gt;This feature helps the build farm avoid loading code for the wrong platform. It used to be that vocabularies with the "unportable" tag set would be skipped by &lt;code&gt;load-all&lt;/code&gt;, however this was too coarse-grained. For example, both the curses and DirectX bindings were tagged as unportable, and so the build farm was not loading or testing them on any platform. However, curses is available on Unix systems, and DirectX is available on Windows systems. With the new approach, there is a &lt;code&gt;extra/curses/platforms.txt&lt;/code&gt; file listing &lt;code&gt;unix&lt;/code&gt; as a supported platform, and the various DirectX vocabulary directories have &lt;code&gt;platforms.txt&lt;/code&gt; files listing &lt;code&gt;windows&lt;/code&gt;.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17087850-646241296611545719?l=factor-language.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://factor-language.blogspot.com/feeds/646241296611545719/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17087850&amp;postID=646241296611545719' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/646241296611545719'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/646241296611545719'/><link rel='alternate' type='text/html' href='http://factor-language.blogspot.com/2010/02/final-classes-and-platform-specific.html' title='Final classes and platform-specific vocabularies'/><author><name>Slava Pestov</name><uri>http://www.blogger.com/profile/02768382790667979877</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17087850.post-8109546937023391345</id><published>2010-02-16T03:47:00.003-05:00</published><updated>2010-02-16T05:49:10.945-05:00</updated><title type='text'>Factor 0.92 now available</title><content type='html'>&lt;p&gt;I'm proud to announce the release of Factor 0.92. This release comes two years after the last release, &lt;a href="http://factor-language.blogspot.com/2007/12/factor-091-now-available.html"&gt;0.91&lt;/a&gt;. Proceed to &lt;a href="http://factorcode.org/"&gt;the Factor website&lt;/a&gt; or directly to &lt;a href="http://downloads.factorcode.org/releases/0.92/"&gt;downloads directory&lt;/a&gt; to obtain a copy.&lt;/p&gt;  &lt;p&gt;More than 30 developers have contributed code to this release. I would like to thank all of the users and contributors for their efforts; Factor would not be at the stage it is today without your help.&lt;/p&gt;  &lt;p&gt;Since there have been so many changes since the last release, the below changelog is only an incomplete summary. In particular, there have been many incompatible syntax and core library changes since the last release.&lt;/p&gt;  &lt;p&gt;The next release cycle will not last as long as this one; from this point on, I'm planning on having monthly releases or so. There will be fewer incompatible language changes in the upcoming months; the core language has almost been finalized at this point.&lt;/p&gt;  &lt;h3&gt;Language improvements&lt;/h3&gt;  &lt;ul&gt; &lt;li&gt;&lt;a href="http://docs.factorcode.org/content/article-inference.html"&gt;Static stack effects enforced for all words&lt;/a&gt; (Slava Pestov, Daniel Ehrenberg)&lt;/li&gt; &lt;li&gt;&lt;a href="http://docs.factorcode.org/content/article-objects.html"&gt;New object system&lt;/a&gt;: generic accessors, inheritance replaces delegation, type declarations on slots, read only slots, singletons (Slava Pestov)&lt;/li&gt; &lt;li&gt;&lt;a href="http://docs.factorcode.org/content/article-dataflow-combinators.html"&gt;Data flow combinators&lt;/a&gt; (Eduardo Cavazos)&lt;/li&gt; &lt;li&gt;&lt;a href="http://docs.factorcode.org/content/article-fry.html"&gt;Partial application syntax&lt;/a&gt; (Eduardo Cavazos)&lt;/li&gt; &lt;li&gt;Integers-as-sequences is no longer supported; use the &lt;a href="http://docs.factorcode.org/content/article-sequences-integers.html"&gt;iota&lt;/a&gt; virtual sequence instead (Doug Coleman, Slava Pestov)&lt;/li&gt; &lt;/ul&gt;  &lt;h3&gt;Notable new libraries&lt;/h3&gt;  &lt;ul&gt;  &lt;li&gt;&lt;a href="http://docs.factorcode.org/content/article-db.html"&gt;db&lt;/a&gt;: Relational database access library with SQLite and PostgreSQL backends (Doug Coleman)&lt;/li&gt;  &lt;li&gt;&lt;a href="http://docs.factorcode.org/content/article-furnace.html"&gt;furnace&lt;/a&gt;: web framework that runs &lt;a href="http://concatenative.org"&gt;concatenative.org&lt;/a&gt; (Slava Pestov)&lt;/li&gt;  &lt;li&gt;&lt;a href="http://docs.factorcode.org/content/vocab-io.encodings.html"&gt;io.encodings&lt;/a&gt;: Extensive support for I/O encodings. All operations that transform strings to bytes and vice versa now take an encoding parameter (Daniel Ehrenberg)&lt;/li&gt;  &lt;li&gt;&lt;a href="http://docs.factorcode.org/content/article-unicode.html"&gt;unicode&lt;/a&gt;: Unicode-aware case conversion, character classes, collation, breaking, and normalization (Daniel Ehrenberg)&lt;/li&gt;  &lt;li&gt;&lt;a href="http://docs.factorcode.org/content/article-regexp.html"&gt;regexp&lt;/a&gt;: DFA-based regular expression matching. Supports Unicode and compiles to machine code (Daniel Ehrenberg)&lt;/li&gt;  &lt;li&gt;&lt;a href="http://docs.factorcode.org/content/article-io.sockets.secure.html"&gt;io.sockets.secure&lt;/a&gt;: secure sockets with OpenSSL (Slava Pestov)&lt;/li&gt;  &lt;li&gt;&lt;a href="http://docs.factorcode.org/content/vocab-io.files.info.html"&gt;io.files.info&lt;/a&gt;, &lt;a href="http://docs.factorcode.org/content/vocab-io.monitors.html"&gt;io.monitors&lt;/a&gt;: File system metadata and change monitoring (Doug Coleman, Slava Pestov)&lt;/li&gt;  &lt;li&gt;&lt;a href="http://docs.factorcode.org/content/vocab-images.html"&gt;images&lt;/a&gt;: loading and displaying BMP, TIFF, PNG, JPEG and GIF images (Doug Coleman, Marc Fauconneau)&lt;/li&gt;  &lt;li&gt;&lt;a href="http://docs.factorcode.org/content/article-gpu-summary.html"&gt;gpu&lt;/a&gt;, &lt;a href="http://docs.factorcode.org/content/vocab-game.html"&gt;game&lt;/a&gt;: Frameworks for GPU rendering and game development using OpenGL (Joe Groff)&lt;/li&gt;  &lt;li&gt;&lt;a href="http://factor-language.blogspot.com/2009/09/advanced-floating-point-features.html"&gt;math.floats.env&lt;/a&gt;: advanced IEEE floating point features (Joe Groff)&lt;/li&gt;  &lt;li&gt;&lt;a href="http://docs.factorcode.org/content/vocab-math.blas.html"&gt;math.blas&lt;/a&gt;, &lt;a href="http://docs.factorcode.org/content/article-alien.fortran.html"&gt;alien.fortran&lt;/a&gt;: bindings to BLAS linear algebra library, built on top of a new Fortran FFI&lt;/li&gt;  &lt;li&gt;&lt;a href="http://docs.factorcode.org/content/article-math.vectors.simd.html"&gt;math.vectors.simd&lt;/a&gt;: high-performance SIMD arithmetic with support for SSE versions up to 4.2 (Joe Groff)&lt;/li&gt;  &lt;li&gt;&lt;a href="http://docs.factorcode.org/content/article-specialized-arrays.html"&gt;specialized-arrays&lt;/a&gt;, &lt;a href="http://docs.factorcode.org/content/article-specialized-vectors.html"&gt;specialized-vectors&lt;/a&gt;, &lt;a href="http://docs.factorcode.org/content/article-classes.struct.html"&gt;classes.struct&lt;/a&gt;: high-performance support for unboxed value types (Slava Pestov, Joe Groff)&lt;/li&gt;  &lt;/ul&gt;  &lt;h3&gt;Implementation improvements&lt;/h3&gt;  &lt;ul&gt;  &lt;li&gt;Thanks to the continuous integration system, binary packages are now available for 13 platforms (Eduardo Cavazos, Slava Pestov)&lt;/li&gt;  &lt;li&gt;The compiler has been rewritten to improve compile time, performance of generated code, and correctness (&lt;a href="http://factor-language.blogspot.com/2008/08/new-optimizer.html"&gt;1&lt;/a&gt;, &lt;a href="http://factor-language.blogspot.com/2008/11/new-low-level-optimizer-and-code.html"&gt;2&lt;/a&gt;, &lt;a href="http://factor-language.blogspot.com/2009/07/improvements-to-factors-register.html"&gt;3&lt;/a&gt;, &lt;a href="http://factor-language.blogspot.com/2009/07/improved-value-numbering-branch.html"&gt;4&lt;/a&gt;, &lt;a href="http://factor-language.blogspot.com/2009/07/dataflow-analysis-computing-dominance.html"&gt;5&lt;/a&gt;, &lt;a href="http://factor-language.blogspot.com/2009/08/global-float-unboxing-and-some-other.html"&gt;6&lt;/a&gt;)  (Slava Pestov)&lt;/li&gt;   &lt;li&gt;&lt;a href="http://factor-language.blogspot.com/2009/05/factors-implementation-of-polymorphic.html"&gt;Polymorphic inline caching&lt;/a&gt; (Slava Pestov)&lt;/li&gt;  &lt;li&gt;The garbage collector has been rewritten. &lt;a href="http://factor-language.blogspot.com/2009/10/improved-write-barriers-in-factors.html"&gt;A more precise write barrier speeds up minor collections&lt;/a&gt;, and &lt;a href="http://factor-language.blogspot.com/2009/11/mark-compact-garbage-collection-for.html"&gt;a mark-sweep-compact algorithm is now used for full collections&lt;/a&gt; (Slava Pestov)&lt;/li&gt;  &lt;li&gt;The Factor UI now supports Unicode font rendering, and the UI developer tools have all seen significant improvements (Slava Pestov)&lt;/li&gt;  &lt;/ul&gt;  &lt;h3&gt;Editor integration&lt;/h3&gt;  &lt;ul&gt;  &lt;li&gt;&lt;a href="http://factor-language.blogspot.com/2009/01/screencast-editing-factor-code-with.html"&gt;fuel&lt;/a&gt;: Factor's Ultimate Emacs Library, a powerful emacs mode for editing Factor code. See &lt;code&gt;misc/fuel/README&lt;/code&gt; in the Factor download for details (Jose A Ortega)&lt;/li&gt;  &lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17087850-8109546937023391345?l=factor-language.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://factor-language.blogspot.com/feeds/8109546937023391345/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17087850&amp;postID=8109546937023391345' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/8109546937023391345'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/8109546937023391345'/><link rel='alternate' type='text/html' href='http://factor-language.blogspot.com/2010/02/factor-092-now-available.html' title='Factor 0.92 now available'/><author><name>Slava Pestov</name><uri>http://www.blogger.com/profile/02768382790667979877</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17087850.post-1954666520736434424</id><published>2010-01-25T14:44:00.013-05:00</published><updated>2010-01-26T01:08:54.834-05:00</updated><title type='text'>Replacing GNU assembler with Factor code</title><content type='html'>Lately a major goal of mine has been to get the Factor VM to be as close as possible to standard C++, and to compile it with at least one non-gcc compiler, namely Microsoft's Windows SDK.&lt;br /&gt;&lt;br /&gt;I've already eliminated usages of gcc-specific features from the C++ source (register variables mostly) and &lt;a href="http://factor-language.blogspot.com/2009/12/freeing-factor-from-gccs-embrace-and.html"&gt;blogged about it&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;The remaining hurdle was that several of the VM's low-level subroutines were written directly in GNU assembler, and not C++. Microsoft's toolchain uses a different assembler syntax, and I don't fancy writing the same routines twice, especially since the code in question is already x86-specific. Instead, I've decided to eliminate assembly code from the VM altogether. Factor's compiler infrastructure has perfectly good assembler DSLs for x86 and PowerPC written in Factor itself already.&lt;br /&gt;&lt;br /&gt;Essentially, I rewrite the GNU assembler source files in Factor itself. The individual assembler routines have been replaced by new functionality added to both the non-optimizing and optimizing compiler backends. This avoids the issue of the GNU assembler -vs- Microsoft assembler syntax entirely. Now all assembly in the implementation is consistently written in the same postfix Factor assembler syntax.&lt;br /&gt;&lt;br /&gt;The non-optimizing compiler already supports "sub-primitives", words whose definition consists of assembler opcodes, inlined at each call site. I added new sub-primitives to these files to replace some of the VM's assembly routines:&lt;br /&gt;&lt;ul&gt; &lt;li&gt;&lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=basis/cpu/x86/bootstrap.factor;hb=HEAD"&gt;basis/cpu/x86/bootstrap.factor&lt;/a&gt;&lt;/li&gt; &lt;li&gt;&lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=basis/cpu/x86/32/bootstrap.factor;hb=HEAD"&gt;basis/cpu/x86/32/bootstrap.factor&lt;/a&gt;&lt;/li&gt; &lt;li&gt;&lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=basis/cpu/x86/64/bootstrap.factor;hb=HEAD"&gt;basis/cpu/x86/64/bootstrap.factor&lt;/a&gt;&lt;/li&gt; &lt;li&gt;&lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=basis/cpu/ppc/bootstrap.factor;hb=HEAD"&gt;basis/cpu/ppc/bootstrap.factor&lt;/a&gt;&lt;/li&gt; &lt;/ul&gt;&lt;br /&gt;&lt;br /&gt;A few entry points are now generated by the optimizing compiler, too. The optimizing compiler has complicated machinery for generating arbitrary machine code. I extended this with a Factor language feature similar to C's inline assembly, where Factor's assembler DSL is used to generate arbitrary assembly within a word's compiled definition. This is more flexible than the non-optimizing compiler's sub-primitives, and has wide applications beyond the immediate task of replacing a few GNU assembler routines with Factor code.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Factor platform ABI&lt;/h3&gt;&lt;br /&gt;Before jumping into a discussion of the various assembly routines in the VM, it is important to understand the Factor calling convention first, and how it differs from C.&lt;br /&gt;&lt;br /&gt;Factor's machine language calling convention in a nutshell:&lt;br /&gt;&lt;ul&gt;&lt;br /&gt;&lt;li&gt;Two registers are reserved, for the data and retain stacks, respectively.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Both registers point into an array of tagged pointers. Quotations and words pass and receive parameters as tagged pointers on the data stack.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;On PowerPC and x86-64, an additional register is reserved for the current VM instance.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;The call stack register (ESP on x86, r1 on PowerPC) is used like in C, with call frames stored in a contiguous fashion.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Call frames must have a bit of metadata so that the garbage collector can mark code blocks that are referenced via return addresses. This ensures that currently-executing code is not deallocated, even if no other references to it remain.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;This GC meta-data consists of three things: the stack frame's size in bytes, a pointer to the start of a compiled code block, and a return address inside that code block. Since every frame records its size and the next frame immediately follows, the garbage collector can trace and update all return addresses accurately.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;A call frame can have arbitrary size, but the garbage collector does not inspect the additional payload; its can be any blob of binary data at all. The optimizing compiler generates large call frames in a handful of rare situations when a scratch area is needed for raw data.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Quotations must be called with a pointer to the quotation object in a distinguished register (even on x86-32, where the C ABI does not use registers at all). Remaining registers do not have to be preserved, and can be used for any purpose in the compiled code.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Tail calls to compiled words must load the program counter into a special register (EBX on x86-32). This allows polymorphic inline caches at tail call sites to patch the call address if the cache misses. A non-tail call PIC can look at the return address on the call stack, but for a space-saving tail-call, this is not available, so to make inline caching work in all cases, tail calls have to pass this address directly. The only compiled blocks that read the value of this register on entry are tail call PIC miss stubs.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;All other calls to compiled words are made without any registers having defined contents at all, so effectively all registers that are not reserved for a specific purpose are volatile.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;The call stack pointer must be suitably aligned so that SIMD code can spill vector data to the call frame. This is already the case in the C ABI on all platforms except non-Mac 32-bit x86.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Replacing the &lt;code&gt;c_to_factor()&lt;/code&gt; entry point&lt;/h3&gt;&lt;br /&gt;There are two situations where C code needs to jump into Factor; when Factor is starting up, and when C functions invoke function pointers generated by &lt;code&gt;alien-callback&lt;/code&gt;.&lt;br /&gt;&lt;br /&gt;There used to be a &lt;code&gt;c_to_factor()&lt;/code&gt; function defined in a GNU assembly source file as part of the VM, that would take care of translating from the C ABI to the Factor ABI. C++ code can call assembly routines that obey the C calling convention directly.&lt;br /&gt;&lt;br /&gt;Now that the special assembly entry point is gone from the VM, a valid question to ask is how is it even possible to switch ABIs and jump out into Factor-land, without stepping outside the bounds of C++, by writing some inline assembler at least. It seems like an impossible dilemma.&lt;br /&gt;&lt;br /&gt;The Factor VM ensures that the transition stub that C code uses to call into Factor is generated seemingly out of thin air. It turns out the only unportable C++ feature you really need when bootstrapping a JIT is the ability to cast a data pointer into a function pointer, and call the result.&lt;br /&gt;&lt;br /&gt;The new &lt;code&gt;factor_vm::c_to_factor()&lt;/code&gt; method, called on VM startup, looks for a function pointer in a member variable named &lt;code&gt;factor_vm::c_to_factor_func&lt;/code&gt;. Initially, the value is NULL, and if this is the case, it dynamically generates the entry point and then calls the brand-new function pointer:&lt;br /&gt;&lt;pre&gt;void factor_vm::c_to_factor(cell quot)&lt;br /&gt;{&lt;br /&gt;    /* First time this is called, wrap the c-to-factor sub-primitive inside&lt;br /&gt;    of a callback stub, which saves and restores non-volatile registers&lt;br /&gt;    as per platform ABI conventions, so that the Factor compiler can treat&lt;br /&gt;    all registers as volatile */&lt;br /&gt;    if(!c_to_factor_func)&lt;br /&gt;    {&lt;br /&gt;        tagged&amp;lt;word&gt; c_to_factor_word(special_objects[C_TO_FACTOR_WORD]);&lt;br /&gt;        code_block *c_to_factor_block = callbacks-&gt;add(c_to_factor_word.value(),0);&lt;br /&gt;        c_to_factor_func = (c_to_factor_func_type)c_to_factor_block-&gt;entry_point();&lt;br /&gt;    }&lt;br /&gt;&lt;br /&gt;    c_to_factor_func(quot);&lt;br /&gt;}&lt;/pre&gt;&lt;br /&gt;All machine code generated by the Factor compiler is stored in the code heap, where blocks of code can move. But &lt;code&gt;c_to_factor()&lt;/code&gt; needs a stable function pointer to make the initial jump out of C and into Factor. As I briefly mentioned in a blog post about &lt;a href="http://factor-language.blogspot.com/2009/11/mark-compact-garbage-collection-for.html"&gt;mark sweep compact garbage collection&lt;/a&gt;, Factor has a separate &lt;i&gt;callback heap&lt;/i&gt; for allocating unmovable function pointers intended to be passed to C.&lt;br /&gt;&lt;br /&gt;This callback heap is used for the initial startup entry point too, as well as callbacks generated by &lt;code&gt;alien-callback&lt;/code&gt;..&lt;br /&gt;&lt;br /&gt;As mentioned in the comment, the callback stub now takes care of saving&lt;br /&gt;and restoring non-volatile registers, as well as aligning the stack frame. You can see how callback stubs are defined with the Factor assembler by grepping for &lt;code&gt;callback-stub&lt;/code&gt; in the non-optimizing compiler backend.&lt;br /&gt;&lt;br /&gt;The new callback stub covers part of what the old assembly &lt;code&gt;c_to_factor()&lt;/code&gt; entry point did. The remaining component is calling the quotation itself, and this is now done by a special word &lt;code&gt;c-to-factor&lt;/code&gt;.&lt;br /&gt;&lt;br /&gt;The &lt;code&gt;c-to-factor&lt;/code&gt; word loads the data stack and retain stack pointers and jumps to the quotation's compiled definition. Grep for &lt;code&gt;c-to-factor-impl&lt;/code&gt; in the non-optimizing compiler backend.&lt;br /&gt;&lt;br /&gt;In effect, by abusing the non-optimizing compiler's support for "subprimitives", machine code for the C-to-Factor entry point can be generated by the VM itself.&lt;br /&gt;&lt;br /&gt;Callbacks generated by &lt;code&gt;alien-callback&lt;/code&gt; in the optimizing compiler used to contain a call to the &lt;code&gt;c_to_factor()&lt;/code&gt; assembly routine. The equivalent machine code is now generated directly by the optimizing compiler.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Replacing the &lt;code&gt;throw_impl()&lt;/code&gt; entry point&lt;/h3&gt;&lt;br /&gt;When Factor code throws an error, a continuation is popped off the catch stack, and resumed. When the VM needs to throw an error, it has to go through the same motions, but perform a non-local return to unwind any C++ stack frames first, before it can jump back into Factor and resume another continuation.&lt;br /&gt;&lt;br /&gt;There used to be an assembly routine named &lt;code&gt;throw_impl()&lt;/code&gt; which would take a quotation and a new value for the stack pointer.&lt;br /&gt;&lt;br /&gt;This is now handled in a similar manner to &lt;code&gt;c_to_factor()&lt;/code&gt;. The &lt;code&gt;unwind-native-frames&lt;/code&gt; word in &lt;code&gt;kernel.private&lt;/code&gt; is another one of those very special sub-primitives that uses the C calling convention for receiving parameters. It reloads the data and retain stack registers, and changes the call stack pointer to the given parameter. The call is coming in from C++ code, and the contents of these registers are not guaranteed since they play no special role in the C ABI. Grep for &lt;code&gt;unwind-native-frames&lt;/code&gt; in the non-optimizing compiler backend.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Replacing the &lt;code&gt;lazy_jit_compile_impl()&lt;/code&gt; and &lt;code&gt;set_callstack()&lt;/code&gt;entry points&lt;/h3&gt;&lt;br /&gt;As I discussed in my most recent blog post, on the &lt;a href="http://factor-language.blogspot.com/2010/01/how-factor-implements-closures.html"&gt;implementation of closures in Factor&lt;/a&gt;, quotation compilation is deferred, and initially all quotations point to the same shared entry point. This shared entry point used to be an assembly routine in the VM. It is now the &lt;code&gt;lazy-jit-compile&lt;/code&gt; sub-primitive.&lt;br /&gt;&lt;br /&gt;The &lt;code&gt;set-callstack&lt;/code&gt; primitive predates the existence of sub-primitives, so it was implemented as an assembly routine in the VM for historical reasons.&lt;br /&gt;&lt;br /&gt;These two entry points are never called directly from C++ code in the VM, so unlike &lt;code&gt;c_to_factor()&lt;/code&gt; and &lt;code&gt;throw_impl()&lt;/code&gt;, there is no C++ code to fish out generated code from a special word and turn it into a function pointer.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Inline assembly with the &lt;code&gt;alien-assembly&lt;/code&gt; combinator&lt;/h3&gt;&lt;br /&gt;I added a new word, &lt;code&gt;alien-assembly&lt;/code&gt;. In the same way as &lt;code&gt;alien-invoke&lt;/code&gt;, it generates code which marshals Factor data into C values, and passes them according to the C calling convention; but where &lt;code&gt;alien-invoke&lt;/code&gt; would generate a&lt;br /&gt;subroutine call, &lt;code&gt;alien-assembly&lt;/code&gt; just calls the quotation at&lt;br /&gt;compile time instead, no questions asked. The quotation can emit&lt;br /&gt;any machine code it desires, but the result has to obey the C calling&lt;br /&gt;convention.&lt;br /&gt;&lt;br /&gt;Here is an example: a horribly unportable way to add two floating-point numbers that only works on x86.64:&lt;br /&gt;&lt;pre&gt;: add ( a b -- c )&lt;br /&gt;    double { double double } "cdecl"&lt;br /&gt;    [ XMM0 XMM1 ADDPD ]&lt;br /&gt;    alien-assembly ;&lt;br /&gt;&lt;br /&gt;1.5 2.0 add .&lt;br /&gt;3.5&lt;/pre&gt;&lt;br /&gt;The important thing is that unlike assembly code in the VM, using this feature the same assembly code will work regardless of whether the Factor VM was compiled with gcc or the Microsoft toolchain!&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Replacing FPU state entry points&lt;/h3&gt;&lt;br /&gt;The remaining VM assembly routines were used to save and restore FPU state used for &lt;a href="http://factor-language.blogspot.com/2009/09/advanced-floating-point-features.html"&gt;advanced floating point features&lt;/a&gt;. The &lt;code&gt;math.floats.env&lt;/code&gt; vocabulary would call these assembly  routines as if they were ordinary C functions, using &lt;code&gt;alien-invoke&lt;/code&gt;. After the refactoring, the optimizing compiler now generates code for these routines using &lt;code&gt;alien-assembly&lt;/code&gt; instead. A hook dispatching on CPU type is used to pick the implementation for the current CPU.&lt;br /&gt;&lt;br /&gt;See these files for details:&lt;br /&gt;&lt;ul&gt; &lt;li&gt;&lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=basis/math/floats/env/x86/32/32.factor"&gt;basis/math/floats/env/x86/32/32.factor&lt;/a&gt;&lt;/li&gt; &lt;li&gt;&lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=basis/math/floats/env/x86/64/64.factor"&gt;basis/math/floats/env/x86/64/64.factor&lt;/a&gt;&lt;/li&gt; &lt;/ul&gt;&lt;br /&gt;&lt;br /&gt;On PowerPC, using non-gcc compilers is not a goal, so these routines remain in the VM, and the &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=vm/cpu-ppc.S"&gt;vm/cpu-ppc.S&lt;/a&gt; source file still exists. It contains these FPU routines along with one other entry point for flushing the instruction cache (this is a no-op on x86).&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Compiling Factor with the Windows SDK&lt;/h3&gt;&lt;br /&gt;With this refactoring out of the way, the Factor VM can now be compiled using the Microsoft toolchain, in addition to Cygwin and Mingw. The primary benefit of using the Microsoft toolchain is that it has allowed me to revive the 64-bit Windows support.&lt;br /&gt;&lt;br /&gt;Last time I managed to &lt;a href="http://factor-language.blogspot.com/2008/11/factor-port-to-win64-in-progress.html"&gt;get gcc working successfully on Win64&lt;/a&gt;, the Factor VM was written in C. The switch to C++ killed the Win64 port, because the 64-bit Windows Mingw port is a huge pain to install correctly. I never got gcc to produce a working executable after the C++ rewrite of the VM, and as a result we haven't had a binary package on this platform since April 2009.&lt;br /&gt;&lt;br /&gt;Microsoft's free (as in beer) &lt;a href="http://www.microsoft.com/downloads/details.aspx?FamilyID=c17ba869-9671-4330-a63e-1fd44e0e2505&amp;displaylang=en"&gt;Windows SDK&lt;/a&gt; includes command-line compilers for x86-32 and x86-64, together with various tools, such as a linker, and &lt;code&gt;nmake&lt;/code&gt;, a program similar to Unix &lt;code&gt;make&lt;/code&gt;. Factor now ships with an &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=Nmakefile"&gt;Nmakefile&lt;/a&gt; that uses the SDK tools to build the VM:&lt;br /&gt;&lt;pre&gt;nmake /f nmakefile&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;After fixing some minor compile errors, the Windows SDK produced a working Win64 executable. After updating the FFI code a little, I quickly got it passing all of the compiler tests. As a result of this work, Factor binaries for 64-bit Windows will be available again soon.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17087850-1954666520736434424?l=factor-language.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://factor-language.blogspot.com/feeds/1954666520736434424/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17087850&amp;postID=1954666520736434424' title='6 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/1954666520736434424'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/1954666520736434424'/><link rel='alternate' type='text/html' href='http://factor-language.blogspot.com/2010/01/replacing-gnu-assembler-with-factor.html' title='Replacing GNU assembler with Factor code'/><author><name>Slava Pestov</name><uri>http://www.blogger.com/profile/02768382790667979877</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>6</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17087850.post-727703538517691144</id><published>2010-01-18T02:45:00.009-05:00</published><updated>2010-01-18T07:08:34.824-05:00</updated><title type='text'>How Factor implements closures</title><content type='html'>The recent blog post from the Clozure CL team on &lt;a href="http://ccl.clozure.com/blog/?p=53"&gt;Clozure CL's implementation of closures&lt;/a&gt; inspired me to do a similar writeup about Factor's implementation. It is often said that "closures are a poor man's objects", or "objects are a poor man's closures". Factor takes the former view, because as you will see they are largely implemented within the object system itself.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Quotations&lt;/h3&gt;&lt;br /&gt;First, let us look at quotations, and what happens internally when a quotation is called. A &lt;a href="http://docs.factorcode.org/content/article-quotations.html"&gt;quotation&lt;/a&gt; is a sequence of literals and words. Quotations do not close over any lexical environment; they are entirely self-contained, and their evaluation semantics only depend on their elements, not any state from the time they were constructed. So quotations are anonymous functions but not closures.&lt;br /&gt;&lt;br /&gt;Internally, a quotation is a pair, consisting of an array and a machine code entry point. The array stores the quotation's elements, and when you print a quotation with the prettyprinter, this is how Factor knows what its elements are: &lt;pre&gt;( scrachpad ) [ "out.txt" utf8 [ "Hi" print ] with-file-writer ] .&lt;br /&gt;[ "out.txt" utf8 [ "Hi" print ] with-file-writer ]&lt;/pre&gt;&lt;br /&gt;The machine code entry point refers to a code block in the code heap containing the quotation's compiled definition.&lt;br /&gt;&lt;br /&gt;The &lt;a href="http://docs.factorcode.org/content/word-call%2Ckernel.html"&gt;call&lt;/a&gt; generic word calls quotations. This word can take any &lt;a href="http://docs.factorcode.org/content/word-callable%2Cquotations.html"&gt;callable&lt;/a&gt;, not just a quotation, and has a method for each type of callable. The &lt;code&gt;callable&lt;/code&gt; class includes quotations, as well as &lt;code&gt;curry&lt;/code&gt; and &lt;code&gt;compose&lt;/code&gt; which are discussed below. This means that closure invocation is implemented on top of method dispatch in Factor.&lt;br /&gt; &lt;br /&gt;For reasons that will become clear in the last section on compiler optimizations, most quotations never have their entry points called directly, and so it would be a waste of time to compile all quotations as they are read by the parser.&lt;br /&gt;&lt;br /&gt;Instead, all freshly-parsed quotations have their entry points set to the &lt;code&gt;lazy-jit-compile&lt;/code&gt; primitive from the &lt;a href="http://docs.factorcode.org/content/vocab-kernel.private.html"&gt;kernel.private&lt;/a&gt; vocabulary.&lt;br /&gt;&lt;br /&gt;The &lt;code&gt;call&lt;/code&gt; generic word has a method on the quotation class. This method invokes the &lt;a href="http://docs.factorcode.org/content/word-%28call%29%2Ckernel.private.html"&gt;(call)&lt;/a&gt; primitive from the kernel.private vocabulary. The &lt;code&gt;(call)&lt;/code&gt; primitive does not type check, since by the time it is called it is known that the input is in fact a quotation. This primitive has a very simple machine code definition: &lt;pre&gt;mov eax,[esi]    ! Pop datastack&lt;br /&gt;sub esi,4&lt;br /&gt;jmp [eax+13]     ! Jump to quotation's entry point&lt;/pre&gt;Note that the quotation is stored in the EAX register; this is important. Recall that initially, a quotation's entry point is set to the &lt;code&gt;lazy-jit-compile&lt;/code&gt; word, and that all quotations initially share this entry point.&lt;br /&gt;&lt;br /&gt;This word, which is not meant to be invoked directly, compiles quotations the first time they are called. Since all quotations share the same initial entry point, it needs to know which quotation invoked it. This is done by passing the quotation to this word in the EAX register. The &lt;code&gt;lazy-jit-compile&lt;/code&gt; word compiles this quotation, sets its entry point to the new compiled code block, and then calls it again. On subsequent calls of the same quotation, the new compiled definition will be jumped to directly, and the &lt;code&gt;lazy-jit-compile&lt;/code&gt; entry point is not involved.&lt;br /&gt;&lt;br /&gt;If you're interested in the definition of &lt;code&gt;lazy-jit-compile&lt;/code&gt;, search for it in these files:&lt;ul&gt;&lt;li&gt;&lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=basis/cpu/x86/32/bootstrap.factor;hb=HEAD"&gt;basis/cpu/x86/32/bootstrap.factor&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=basis/cpu/x86/64/bootstrap.factor;hb=HEAD"&gt;basis/cpu/x86/64/bootstrap.factor&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=basis/cpu/ppc/bootstrap.factor;hb=HEAD"&gt;basis/cpu/ppc/bootstrap.factor&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;h3&gt;The curry and compose words&lt;/h3&gt; &lt;h4&gt;curry&lt;/h4&gt;&lt;br /&gt;The &lt;a href="http://docs.factorcode.org/content/word-curry%2Ckernel.html"&gt;curry&lt;/a&gt; word is the fundamental constructor for making closures. It takes a value and a &lt;code&gt;callable&lt;/code&gt;, and returns something that prints out as if it were a new quotation:&lt;pre&gt;( scratchpad ) 5 [ + 2 * ] curry .&lt;br /&gt;[ 5 + 2 * ]&lt;/pre&gt;&lt;br /&gt;This is not a quotation though, but rather an instance of the &lt;code&gt;curry&lt;/code&gt; class. Instances of this class are pairs of elements: an object, and another &lt;code&gt;callable&lt;/code&gt;.&lt;br /&gt;&lt;br /&gt;Instances of &lt;code&gt;curry&lt;/code&gt; are &lt;code&gt;callable&lt;/code&gt;, because the &lt;code&gt;call&lt;/code&gt; generic word has a method on &lt;code&gt;curry&lt;/code&gt;. This method pushes the first element of the pair on the data stack, then recursively calls &lt;code&gt;call&lt;/code&gt; on the second element.&lt;br /&gt;    &lt;br /&gt;Calls to &lt;code&gt;curry&lt;/code&gt; can be chained together: &lt;pre&gt;( scratchpad ) { "a" "b" "c" } 1 2 [ 3array ] curry curry map .&lt;br /&gt;{ { "a" 1 2 } { "b" 1 2 } { "c" 1 2 } }&lt;/pre&gt;&lt;br /&gt;Note that using &lt;code&gt;curry&lt;/code&gt;, many callables can be constructed which share the same compiled definition; only the data value differes. &lt;br /&gt;&lt;br /&gt;Technically, &lt;code&gt;curry&lt;/code&gt; is just an optimization -- it would be possible to simulate it by using sequence words to construct a new quotation with the desired value prepended, but this would be extremely inefficient. Prepending an element would take O(n) time, and furthermore, result in the new quotation being compiled the first time the result was called. Indeed, in some simpler concatenative languages such as &lt;a href="http://www.latrobe.edu.au/philosophy/phimvt/joy.html"&gt;Joy&lt;/a&gt;, quotations are just linked lists, and they execute in the interpreter, so partial application can be done by using the &lt;code&gt;cons&lt;/code&gt; primitive for creating a new linked list.&lt;br /&gt;&lt;br /&gt;&lt;h4&gt;compose&lt;/h4&gt;&lt;br /&gt;The &lt;a href="http://docs.factorcode.org/content/word-compose%2Ckernel.html"&gt;compose&lt;/a&gt; word takes two &lt;code&gt;callable&lt;/code&gt;s and returns a new instance of the &lt;code&gt;compose&lt;/code&gt; class. Instances of this class are pairs of &lt;code&gt;callable&lt;/code&gt;s. &lt;pre&gt;( scratchpad ) [ 3 + ] [ sqrt ] compose .&lt;br /&gt;[ 3 + sqrt ]&lt;/pre&gt;&lt;br /&gt;As with &lt;code&gt;curry&lt;/code&gt;, the &lt;code&gt;call&lt;/code&gt; generic word has a method on the &lt;code&gt;compose&lt;/code&gt; class. This method recursively applies &lt;code&gt;call&lt;/code&gt; to both elements.&lt;br /&gt;&lt;br /&gt;It is possible to express the effect of &lt;code&gt;compose&lt;/code&gt; using &lt;code&gt;curry&lt;/code&gt; and &lt;code&gt;dip&lt;/code&gt;: &lt;pre&gt;: my-compose ( quot1 quot2 -- compose )&lt;br /&gt;    [ [ call ] dip call ] curry curry ; inline&lt;/pre&gt;&lt;br /&gt;The main reason &lt;code&gt;compose&lt;/code&gt; exists as a distinct type is to make the result prettyprint better. Were it defined as above, the result would not prettyprint in a nice way: &lt;pre&gt;( scratchpad ) [ 3 + ] [ sqrt ] [ [ call ] dip call ] curry curry .&lt;br /&gt;[ [ 3 + ] [ sqrt ] [ call ] dip call ]&lt;/pre&gt;&lt;br /&gt;The &lt;code&gt;curry&lt;/code&gt; and &lt;code&gt;compose&lt;/code&gt; constructors are sufficient to express all higher-level forms of closures.&lt;br /&gt;&lt;br /&gt;&lt;h4&gt;An aside: compose and dip&lt;/h4&gt;&lt;br /&gt;It is possible to express &lt;code&gt;compose&lt;/code&gt; using &lt;code&gt;curry&lt;/code&gt; and &lt;code&gt;dip&lt;/code&gt;. Conversely, it is also possible to express &lt;code&gt;dip&lt;/code&gt; using &lt;code&gt;compose&lt;/code&gt; and &lt;code&gt;curry&lt;/code&gt;. &lt;pre&gt;: my-dip ( a quot -- ) swap [ ] curry compose call ;&lt;/pre&gt;&lt;br /&gt;Here is an example of how this works. Suppose we have the following code: &lt;pre&gt;123 321 [ number&gt;string ] my-dip&lt;/pre&gt;&lt;br /&gt;Using the above definition of &lt;code&gt;my-dip&lt;/code&gt;, we get &lt;pre&gt;123 321 [ number&gt;string ] swap [ ] curry compose call  ! substitute definition of 'my-dip'&lt;br /&gt;123 [ number&gt;string ] 321 [ ] curry compose call       ! evaluate swap&lt;br /&gt;123 [ number&gt;string ] [ 321 ] compose call             ! evaluate curry&lt;br /&gt;123 [ number&gt;string 321 ] call                         ! evaluate compose&lt;br /&gt;123 number&gt;string 321                                  ! evaluate call&lt;br /&gt;"123" 321                                              ! evaluate number&gt;string&lt;/pre&gt;&lt;br /&gt;&lt;h3&gt;Fry syntax&lt;/h3&gt;&lt;br /&gt;The &lt;a href="http://docs.factorcode.org/content/article-fry.html"&gt;fry syntax&lt;/a&gt; provides nicer syntax sugar for more complicated usages of &lt;code&gt;curry&lt;/code&gt; and &lt;code&gt;compose&lt;/code&gt;. Beginners learning Factor should start with fry syntax, and probably don't need to know about &lt;code&gt;curry&lt;/code&gt; and &lt;code&gt;compose&lt;/code&gt; at all; but this syntax desugars trivially into &lt;code&gt;curry&lt;/code&gt; and &lt;code&gt;compose&lt;/code&gt;, as explained in the documentation.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Lexical variables&lt;/h3&gt;&lt;br /&gt;Code written with &lt;a href="http://docs.factorcode.org/content/article-locals.html"&gt;the locals vocabulary&lt;/a&gt; can create closures by referencing lexical variables from nested quotations. For example, here is a word from the compiler which computes the breadth-first order on a control-flow graph: &lt;pre&gt;:: breadth-first-order ( cfg -- bfo )&lt;br /&gt;    &amp;lt;dlist&gt; :&gt; work-list&lt;br /&gt;    cfg post-order length &lt;vector&gt; :&gt; accum&lt;br /&gt;    cfg entry&gt;&gt; work-list push-front&lt;br /&gt;    work-list [&lt;br /&gt;        [ accum push ]&lt;br /&gt;        [ dom-children work-list push-all-front ] bi&lt;br /&gt;    ] slurp-deque&lt;br /&gt;    accum ;&lt;/pre&gt;&lt;br /&gt;In the body of the word, three lexical variables are used; the input parameter &lt;code&gt;cfg&lt;/code&gt;, and the two local bindings &lt;code&gt;work-list&lt;/code&gt; and &lt;code&gt;accum&lt;/code&gt;.&lt;br /&gt;&lt;br /&gt;When the parser reads the above definition, it creates several quotations whose bodies reference local variables, for example, &lt;code&gt;[ accum push ]&lt;/code&gt;. Before defining the new word, however, the :: parsing word rewrites this into concatenative code.&lt;br /&gt;&lt;br /&gt;Suppose we were doing this rewrite by hand. The first step would be to make closed-over variables explicit. We can do this by currying the values of &lt;code&gt;accum&lt;/code&gt; and &lt;code&gt;work-list&lt;/code&gt; onto the two quotations passed to &lt;code&gt;bi&lt;/code&gt;: &lt;pre&gt;:: breadth-first-order ( cfg -- bfo )&lt;br /&gt;    &amp;lt;dlist&gt; :&gt; work-list&lt;br /&gt;    cfg post-order length &lt;vector&gt; :&gt; accum&lt;br /&gt;    cfg entry&gt;&gt; work-list push-front&lt;br /&gt;    work-list [&lt;br /&gt;        accum [ push ] curry&lt;br /&gt;        work-list [ [ dom-children ] dip push-all-front ] curry bi&lt;br /&gt;    ] slurp-deque&lt;br /&gt;    accum ;&lt;/pre&gt;&lt;br /&gt;And now, we can do the same transformation on the quotation passed to &lt;code&gt;slurp-deque&lt;/code&gt;: &lt;pre&gt;:: breadth-first-order ( cfg -- bfo )&lt;br /&gt;    &amp;lt;dlist&gt; :&gt; work-list&lt;br /&gt;    cfg post-order length &lt;vector&gt; :&gt; accum&lt;br /&gt;    cfg entry&gt;&gt; work-list push-front&lt;br /&gt;    work-list accum work-list [&lt;br /&gt;        [ [ push ] curry ] dip&lt;br /&gt;        [ [ dom-children ] dip push-all-front ] curry bi&lt;br /&gt;    ] curry curry slurp-deque&lt;br /&gt;    accum ;&lt;/pre&gt;&lt;br /&gt;Note how three usages of &lt;code&gt;curry&lt;/code&gt; have appeared, and all lexical variable usage now only occurs at the same level where the variable is actually defined. The final rewrite into concatenative code is trivial, and involves some tricks with &lt;code&gt;dup&lt;/code&gt; and &lt;code&gt;dip&lt;/code&gt;.&lt;br /&gt;&lt;br /&gt;A lexical closure closing over many variables will be rewritten into code which builds a long chain of &lt;code&gt;curry&lt;/code&gt; instances, essentially a linked list. This is less efficient than closure representations used in Lisp implementations, where typically all closed-over values are stored in a single array. However in practice this is not usually a problem, because of optimizations outlined below.&lt;br /&gt;&lt;h4&gt;Mutable local variables&lt;/h4&gt;&lt;br /&gt;Mutable local variables are denoted by suffixing their name with &lt;code&gt;!&lt;/code&gt;. Here is an example of a loop that counts a mutable variable up to 100:&lt;pre&gt;:: counted-loop-test ( -- )&lt;br /&gt;    0 :&gt; i!&lt;br /&gt;    100 [ i 1 + i! ] times ;&lt;/pre&gt;&lt;br /&gt;Factor distinguishes between binding (done with &lt;code&gt;:&gt;&lt;/code&gt;) and assignment (done with &lt;code&gt;foo!&lt;/code&gt; where &lt;code&gt;foo&lt;/code&gt; is a local previously declared to be mutable). At a fundamental level, all bindings and closures are immutable; code using mutable locals is rewritten to close over a mutable heap-allocated box instead, and reads and writes involve an extra indirection. The previous example is roughly equivalent to the following, where we use an immutable local variable holding a mutable one-element array:&lt;pre&gt;:: counted-loop-test ( -- )&lt;br /&gt;    { 0 } clone :&gt; i&lt;br /&gt;    100 [ i first 1 + i set-first ] times ;&lt;/pre&gt;&lt;br /&gt;For this reason, code that uses mutable locals does not optimize as well, and iterative loops that uses mutable local variables can run slower than tail-recursive functions which uses immutable local variables. This might be addressed in the future with improved compiler optimizations.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Optimizations&lt;/h3&gt; &lt;h4&gt;Quotation inlining&lt;/h4&gt;&lt;br /&gt;If after inlining, the compiler sees that &lt;code&gt;call&lt;/code&gt; is applied to a literal quotation, it inlines the quotation's body at the call site. This optimization works very well when quotations are used as "downward closures", and this is why most quotations never have their dynamic entry point invoked at all.&lt;br /&gt;&lt;br /&gt;&lt;h4&gt;Escape analysis&lt;/h4&gt;&lt;br /&gt;Since &lt;code&gt;curry&lt;/code&gt; and &lt;code&gt;compose&lt;/code&gt; are ordinary tuple classes, any time some code constructs instances of &lt;code&gt;curry&lt;/code&gt; and &lt;code&gt;compose&lt;/code&gt;, and immediately unpacks them, the compiler's escape analysis pass can eliminate the allocations entirely.&lt;br /&gt;      &lt;br /&gt;The escape analysis pass has no special knowledge of &lt;code&gt;curry&lt;/code&gt; and &lt;code&gt;compose&lt;/code&gt;; it applies an optimization intended for object-oriented code.&lt;br /&gt;      &lt;br /&gt;Again, this helps optimize code with "downward closures", with the result that most usages of &lt;code&gt;curry&lt;/code&gt; and &lt;code&gt;compose&lt;/code&gt; will never allocate any memory at run time.&lt;br /&gt;&lt;br /&gt;For more details, see my blog post about &lt;a href="http://factor-language.blogspot.com/2008/08/algorithm-for-escape-analysis.html"&gt;Factor's escape analysis algorithm&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17087850-727703538517691144?l=factor-language.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://factor-language.blogspot.com/feeds/727703538517691144/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17087850&amp;postID=727703538517691144' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/727703538517691144'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/727703538517691144'/><link rel='alternate' type='text/html' href='http://factor-language.blogspot.com/2010/01/how-factor-implements-closures.html' title='How Factor implements closures'/><author><name>Slava Pestov</name><uri>http://www.blogger.com/profile/02768382790667979877</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17087850.post-657377910223810792</id><published>2010-01-10T09:34:00.001-05:00</published><updated>2010-01-10T09:35:50.280-05:00</updated><title type='text'>Factor's bootstrap process explained</title><content type='html'>&lt;h3&gt;Separation of concerns between Factor VM and library code&lt;/h3&gt;&lt;br /&gt;The Factor VM implements an abstract machine consisting of a data heap of objects, a code heap of machine code blocks, and a set of stacks. The VM loads an image file on startup, which becomes the data and code heap. It then begins executing code in the image, by calling a special &lt;i&gt;startup quotation&lt;/i&gt;.&lt;br /&gt;&lt;br /&gt;When new source files are loaded into a running Factor instance by the developer, they are parsed and compiled into a collection of objects -- words, quotations, and other literals, along with executable machine code. The new data and code heaps can then be saved into another image file, for faster loading in the future.&lt;br /&gt;&lt;br /&gt;Factor's core data structures, object system, and source parser are implemented in Factor and live in the image, so the Factor VM does not have enough machinery to start with an empty data and code heap and parse Factor source files by itself. Instead, the VM needs to start from a data and code heap that already contains enough Factor code to parse source files. This poses a chicken-and-egg problem; how do you build a Factor system from source code? The VM can be compiled with a C++ compiler, but the result is not sufficient by itself.&lt;br /&gt;&lt;br /&gt;Some image-based language systems cannot generate new images from scratch at all, and the only way to create a new image is to snapshot an existing session. This is the simplest approach but it has serious downsides -- it lacks determinism and reproducability, and it is difficult to make big changes to the system.&lt;br /&gt;&lt;br /&gt;While Factor can snapshot the current execution state into an image, it also has a tool to generate "fresh" image from source. While this tool is written in Factor and runs inside of an existing Factor system, the resulting images depend as little as possible on the particular state of the system that generated it.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Stage 1 bootstrap: creating a boot image&lt;/h3&gt;&lt;br /&gt;The initial data heap comes from a &lt;i&gt;boot image&lt;/i&gt;, which is built from an existing Factor system, known as the &lt;i&gt;host&lt;/i&gt;. The result is a new boot image which can run in the &lt;i&gt;target&lt;/i&gt; system. Boot images are created using the &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=basis/bootstrap/image/image.factor;hb=HEAD"&gt;bootstrap.image&lt;/a&gt; tool, whose main entry point is the &lt;a href="http://docs.factorcode.org/content/word-make-image%2Cbootstrap.image.html"&gt;make-image&lt;/a&gt; tool. This word can be invoked from the listener in a running Factor instance: &lt;pre&gt;"x86.32" make-image&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;The &lt;a href="http://docs.factorcode.org/content/word-make-image%2Cbootstrap.image.html"&gt;make-image&lt;/a&gt; word parses source files using the host's parser, and the code in those source files forms the target image. This tool can be thought of as a form of cross-compiler, except boot images only contain a data heap, and not a code heap. The code heap is generated on the target by the VM, and later by the target's optimizing compiler in Factor.&lt;br /&gt; &lt;br /&gt;The &lt;a href="http://docs.factorcode.org/content/word-make-image%2Cbootstrap.image.html"&gt;make-image&lt;/a&gt; word runs the source file &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=core/bootstrap/stage1.factor;hb=HEAD"&gt;core/bootstrap/stage1.factor&lt;/a&gt;, which kicks off the bootstrap process.&lt;br /&gt;&lt;br /&gt;&lt;h4&gt;Building the embryo for the new image&lt;/h4&gt;&lt;br /&gt;The global namespace is most important object stored in an image file. The global namespace contains various global variables that are used by the parser, along them the &lt;a href="http://docs.factorcode.org/content/word-dictionary%2Cvocabs.html"&gt;dictionary&lt;/a&gt;. The dictionary is a hashtable mapping vocabulary names to vocabulary objects. Vocabulary objects contain various bits of state, among them a hashtable mapping word names to word objects. Word objects store their definition as a quotation. The dictionary is how code is represented in memory in Factor; it is built and modified by loading source files from disk.&lt;br /&gt;&lt;br /&gt;One of the tasks performed by &lt;code&gt;stage1.factor&lt;/code&gt; is to read the source file &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=core/bootstrap/primitives.factor;hb=HEAD"&gt;core/bootstrap/primitives.factor&lt;/a&gt;. This source file creates a minimal global namespace and dictionary that target code can be loaded into. This initial dictionary consists of primitive words corresponding to all primitives implemented in the VM, along with some initial state for the object system, consisting of built-in classes such as &lt;a href="http://docs.factorcode.org/content/word-array%2Carrays.html"&gt;array&lt;/a&gt;. The code in this file runs in the host, but it constructs objects that will ultimately end up in the boot image of the target.&lt;br /&gt;&lt;br /&gt;A second piece of code that runs in order to prepare the environment for the target is the CPU-specific backend for the VM's non-optimizing compiler. Again, these are source files which run on the host:&lt;br /&gt;&lt;ul&gt;&lt;br /&gt;&lt;li&gt;&lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=basis/cpu/x86/bootstrap.factor;hb=HEAD"&gt;basis/cpu/x86/bootstrap.factor&lt;/a&gt;&lt;/li&gt;&lt;br /&gt;&lt;li&gt;&lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=basis/cpu/x86/32/bootstrap.factor;hb=HEAD"&gt;basis/cpu/x86/32/bootstrap.factor&lt;/a&gt;&lt;/li&gt;&lt;br /&gt;&lt;li&gt;&lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=basis/cpu/x86/64/bootstrap.factor;hb=HEAD"&gt;basis/cpu/x86/64/bootstrap.factor&lt;/a&gt;&lt;/li&gt;&lt;br /&gt;&lt;li&gt;&lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=basis/cpu/ppc/bootstrap.factor;hb=HEAD"&gt;basis/cpu/ppc/bootstrap.factor&lt;/a&gt;&lt;/li&gt;&lt;br /&gt;&lt;/ul&gt;&lt;br /&gt;The non-optimizing compiler does little more than glue chunks of machine code together, so the backends are relatively simple and consist of several dozen short machine code definitions. These machine code chunks are stored as byte arrays, constructed by Factor's x86 and PowerPC assemblers.&lt;br /&gt;&lt;br /&gt;&lt;h4&gt;Loading core source files&lt;/h4&gt;&lt;br /&gt;Once the initial global environment consisting of primitives and built-in classes has been prepared, source files comprising the core library are loaded in. From this point on, code read from disk does not run in the host, only in the target. The host's parser is still being used, though.&lt;br /&gt;&lt;br /&gt;Factor's vocabulary system loads dependencies automatically, so &lt;code&gt;stage1.factor&lt;/code&gt; simply calls &lt;a href="http://docs.factorcode.org/content/word-require%2Cvocabs.loader.html"&gt;require&lt;/a&gt; on a few essential vocabularies which end up pulling in everything in the &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=tree;f=core;hb=HEAD"&gt;core&lt;/a&gt; &lt;a href="http://docs.factorcode.org/content/article-vocabs.roots.html"&gt;vocabulary root&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;During normal operation, any source code at the top level of a source file, not in any definition, is run when the source file is loaded. During stage1 bootstrap, top-level forms from source files in &lt;code&gt;core&lt;/code&gt; are not run on the host. Instead, they need to be run on the target, when the VM is is launched with the new boot image.&lt;br /&gt;&lt;br /&gt;After loading all source files from &lt;code&gt;core&lt;/code&gt;, this startup quotation is constructed. The startup quotation begins by calling top-level forms in core source files in the order in which they were loaded, and then runs &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=basis/bootstrap/stage2.factor;hb=HEAD"&gt;basis/bootstrap/stage2.factor&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;h4&gt;Serializing the result&lt;/h4&gt;&lt;br /&gt;At this point, stage1 bootstrap has constructed a new global namespace consisting of a dictionary, object system meta-data, and other objects, together with a startup quotation which can kick off the next stage of bootstrap.&lt;br /&gt;&lt;br /&gt;Data heap objects that the VM needs to know about, such as the global namespace, startup quotation, and non-optimizing compiler definitions, are stored in an array of "special objects". Entries are defined in &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=vm/objects.hpp;hb=HEAD"&gt;vm/objects.hpp&lt;/a&gt;, and in the image file they are stored in the image header.&lt;br /&gt;&lt;br /&gt;This object graph, rooted at the special objects array, is now serialized to disk into an image file. The bootstrap image generator serializes objects in the same format in which they are stored in the VM's heap, but it does this without dumping VM's memory directly. This allows object layouts to be changed relatively easily, by first updating the bootstrap image tool, generating an image with the new layouts, then updating the VM and running the new VM with the new image.&lt;br /&gt;&lt;br /&gt;The bootstrap image generator also takes care to write the resulting data with the correct cell size and endianness. Along with containing CPU-specific machine code templates for the non-optimizing compiler, this is what makes boot images platform-specific.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Stage 2 bootstrap: fleshing out a development image&lt;/h3&gt;&lt;br /&gt;At this point, the host system has writen a boot image file to disk, and the next stage of bootstrap can begin. This stage runs on the target, and is initiated by starting the Factor VM with the new boot image: &lt;pre&gt;./factor -i=boot.x86.32.image&lt;/pre&gt; The VM reads the new image into an empty data heap. At this point, it also notices that the boot image does not have a code heap, so it cannot start executing the boot quotation just yet.&lt;br /&gt;&lt;br /&gt;&lt;h4&gt;Early initialization&lt;/h4&gt;&lt;br /&gt;Boot images have a special flag set in them which kicks off the "early init" process in the VM. This only takes a few seconds, and entails compiling all words in the image with the non-optimizing compiler. Once this is done, the VM can call the startup quotation. Quotations are also compiled by the non-optimizing compiler the first time they are called.&lt;br /&gt;&lt;br /&gt;This startup quotation was constructed during stage1 bootstrap. It runs top-level forms in core source files, then runs &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=basis/bootstrap/stage2.factor;hb=HEAD"&gt;basis/bootstrap/stage2.factor&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;h4&gt;Loading major subsystems&lt;/h4&gt;&lt;br /&gt;Whereas the goal of stage1 bootstrap is to generate a minimal image that contains barely enough code to be able to load additional source files, stage2 creates a usable development image containing the optimizing compiler, documentation, UI tools, and everything else that makes it into a recognizable Factor system.&lt;br /&gt;&lt;br /&gt;The major vocabularies loaded by stage2 bootstrap include: &lt;ul&gt; &lt;li&gt;&lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=basis/bootstrap/compiler/compiler.factor;hb=HEAD"&gt;Optimizing compiler&lt;/a&gt; -- after the optimizing compiler has been loaded, all words in the image are compiled again. This is the longest part of stage2 bootstrap, taking several minutes.&lt;/li&gt; &lt;li&gt;&lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=basis/bootstrap/help/help.factor;hb=HEAD"&gt;Help system&lt;/a&gt;&lt;/li&gt; &lt;li&gt;&lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=basis/bootstrap/tools/tools.factor;hb=HEAD"&gt;Developer tools&lt;/a&gt;&lt;/li&gt; &lt;li&gt;&lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=basis/bootstrap/ui/ui.factor;hb=HEAD"&gt;UI framework&lt;/a&gt;&lt;/li&gt; &lt;li&gt;&lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=basis/bootstrap/ui/tools/tools.factor;hb=HEAD"&gt;UI tools&lt;/a&gt;&lt;/li&gt; &lt;/ul&gt;&lt;br /&gt;&lt;h4&gt;Finishing up&lt;/h4&gt;&lt;br /&gt;The last step taken by stage2 bootstrap is to install a new startup quotation. This startup quotation does the usual command-line processing; if no switches are specified, it starts the UI listener, otherwise it runs a source file or vocabulary given on the command line.&lt;br /&gt;&lt;br /&gt;Once the new startup quotation has been installed, the current session is saved to a new image file using the &lt;a href="http://docs.factorcode.org/content/word-save-image-and-exit%2Cmemory.html"&gt;save-image-and-exit&lt;/a&gt; word.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17087850-657377910223810792?l=factor-language.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://factor-language.blogspot.com/feeds/657377910223810792/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17087850&amp;postID=657377910223810792' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/657377910223810792'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/657377910223810792'/><link rel='alternate' type='text/html' href='http://factor-language.blogspot.com/2010/01/factors-bootstrap-process-explained.html' title='Factor&apos;s bootstrap process explained'/><author><name>Slava Pestov</name><uri>http://www.blogger.com/profile/02768382790667979877</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17087850.post-1072160069946447975</id><published>2009-12-27T06:59:00.001-05:00</published><updated>2009-12-27T07:04:36.373-05:00</updated><title type='text'>Freeing Factor from gcc's embrace-and-extend C language extensions</title><content type='html'>I have completed a refactoring of the Factor VM, eliminating usage of gcc-specific language extensions, namely &lt;a href="http://gcc.gnu.org/onlinedocs/gcc/Global-Reg-Vars.html"&gt;global register variables&lt;/a&gt;, and on x86-32, the &lt;a href="http://gcc.gnu.org/onlinedocs/gcc/Function-Attributes.html"&gt;regparm calling convention&lt;/a&gt;. My motivation for this is two-fold.&lt;br /&gt;&lt;br /&gt;First of all, I'd like to have the option of compiling the Factor VM with compilers other than gcc, such as Visual Studio. While gcc runs on all of Factor's supported platforms and then some, setting up a build environment using the GNU toolchain on Windows takes a bit of work, especially on 64-bit Windows. Visual Studio will provide an easier alternative for people who wish to build Factor from source on that platform. In the future, I'd also like to try using &lt;a href="http://clang.llvm.org/"&gt;Clang&lt;/a&gt; to build Factor.&lt;br /&gt;&lt;br /&gt;The second reason is that the global register variable extension is poorly implemented in gcc. Anyone who has followed Factor development for a while will know that gcc bugs come up pretty frequently, and most of these seem to be related to global register variables. This is quite simply one of the less well-tested corners of gcc, and the gcc developers seem to mostly ignore optimizer bugs triggered by this feature.&lt;br /&gt;&lt;br /&gt;The Factor VM used a pair of register variables to hold data stack and retain stack pointers. These are just ordinary fields in a struct now. Back in the days when Factor was interpreted and the interpreter was part of the VM, a lot of time was spent executing code within the VM itself, and keeping these pointers in registers was important. Nowadays the Factor implementation compiles to machine code even during interactive use, using a pair of compilers called the non-optimizing compiler and optimizing compiler. Code generated by Factor's compilers tends to dominate the performance profile, rather than code in the VM itself. Compiled code can utilize registers in any matter desired, and so it continues to access the data stack and retain stack through registers. To make it work with the modified VM, the compiler generates code for saving and restoring these registers in the VM's &lt;code&gt;context&lt;/code&gt; structure before and after calls into the VM.&lt;br /&gt;&lt;br /&gt;A few functions defined in the VM used gcc's &lt;code&gt;regparm&lt;/code&gt; calling convention. Normally, on x86-32, function parameters are always passed in an array on the call stack in the &lt;code&gt;esp&lt;/code&gt; register; &lt;code&gt;regparm&lt;/code&gt; functions instead pass the first 1, 2 or 3 arguments in registers. Whether or not this results in a performance boost is debatable, but my motivation for using this feature was not performance but perceived simplicity. The optimizing compiler would generate calls to these functions, and the machine code generated is a little simpler if it can simply stick parameters in registers instead of storing them on the call stack. Eliminating &lt;code&gt;regparm&lt;/code&gt; did not in reality make anything more complex, as only a few locations were affected and the issue was limited to the x86-32 backend only.&lt;br /&gt;&lt;br /&gt;I'm pretty happy with how the refactoring turned out. It did not seem to affect performance at all, which was not too surprising, since code generated by Factor's optimizing compiler did not change, and the additional overhead surrounding Factor to C calls is lost in the noise.&lt;br /&gt;&lt;br /&gt;My goal of getting Factor to build with other compilers is not quite achieved yet, however. While the gcc extensions are gone from C++ code, the VM still has some 800 lines of assembly source using GNU assembler syntax, in the files &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=vm/cpu-x86.32.S;hb=HEAD"&gt;cpu-x86.32.S&lt;/a&gt;, &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=vm/cpu-x86.64.S;hb=HEAD"&gt;cpu-x86.64.S&lt;/a&gt;, and &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=vm/cpu-ppc.S;hb=HEAD"&gt;cpu-ppc.S&lt;/a&gt;. This code includes fundamental facilities which cannot be implemented in C++, such as the main C to Factor call gate, the low-level &lt;a href="http://docs.factorcode.org/content/word-set-callstack%2Ckernel.html"&gt;set-callstack&lt;/a&gt; primitive used by the &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=core/continuations/continuations.factor;hb=HEAD"&gt;continuations implementation&lt;/a&gt;, and a few other things. The assembly source also has a few auxilliary CPU-dependent functions, for example for saving and restoring &lt;a href="http://factor-language.blogspot.com/2009/09/advanced-floating-point-features.html"&gt;FPU flags&lt;/a&gt;, and detecting the SSE version on x86.&lt;br /&gt;&lt;br /&gt;I plan on elimiating the assembly from the VM entirely, by having the Factor compiler generate this code instead. The non-optimizing compiler can generate things such as the C to Factor call gate. For the the remaining assembly routines, such as FPU feature access and SSE version detection, I plan on adding an "inline assembly" facility to Factor itself, much like gcc's &lt;code&gt;asm&lt;/code&gt; statement. The result will be that the VM will be a pure C++ codebase, and the machine code generation will be entirely offloaded into Factor code, where it belongs. Factor's assembler DSL is much nicer to use than GNU assembler.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17087850-1072160069946447975?l=factor-language.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://factor-language.blogspot.com/feeds/1072160069946447975/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17087850&amp;postID=1072160069946447975' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/1072160069946447975'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/1072160069946447975'/><link rel='alternate' type='text/html' href='http://factor-language.blogspot.com/2009/12/freeing-factor-from-gccs-embrace-and.html' title='Freeing Factor from gcc&apos;s embrace-and-extend C language extensions'/><author><name>Slava Pestov</name><uri>http://www.blogger.com/profile/02768382790667979877</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17087850.post-1450401663353361303</id><published>2009-12-06T02:00:00.001-05:00</published><updated>2009-12-06T02:27:53.835-05:00</updated><title type='text'>Reducing image size by eliminating literal tables from code heap entries</title><content type='html'>&lt;h3&gt;Introduction&lt;/h3&gt; The compiled code heap consists of code blocks which reference other code blocks and objects in the data heap. For example, consider the following word:&lt;pre&gt;: hello ( -- ) "Hello world" print ;&lt;/pre&gt;&lt;br /&gt;The machine code for this word is the following:&lt;br /&gt;&lt;pre&gt;000000010ef55f10: 48b87b3fa40d01000000  mov rax, 0x10da43f7b&lt;br /&gt;000000010ef55f1a: 4983c608              add r14, 0x8&lt;br /&gt;000000010ef55f1e: 498906                mov [r14], rax&lt;br /&gt;000000010ef55f21: 48bb305ff50e01000000  mov rbx, 0x10ef55f30 (hello+0x20)&lt;br /&gt;000000010ef55f2b: e9e0ffabff            jmp 0x10ea15f10 (print)&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;The immediate operand of the first &lt;code&gt;mov&lt;/code&gt; instruction (&lt;code&gt;0x10da43f7b&lt;/code&gt;) is the address of the string &lt;code&gt;"Hello world"&lt;/code&gt; in the data heap. The immediate operand of the last &lt;code&gt;jmp&lt;/code&gt; instruction (&lt;code&gt;0x10ea15f10&lt;/code&gt;) is the address of the machine code of the &lt;code&gt;print&lt;/code&gt; word in the code heap.&lt;br /&gt;Unlike some dynamic language JITs where all references to data and compiled code from machine code are done via indirection tables, Factor embeds the actual addresses of the data in the code. This means that the garbage collector needs to be able to find all pointers in a code heap block (for the "sweep" phase of garbage collection), and update them (for the "compact" phase). &lt;br /&gt;&lt;h3&gt;Relocation tables&lt;/h3&gt; Associated to each code block is a relocation table, which tells the VM what instruction operands contain special values that it must be aware of. The relocation table is a list of triples, packed into a byte array: &lt;ul&gt; &lt;li&gt;The &lt;i&gt;relocation type&lt;/i&gt; is an instance of the &lt;code&gt;relocation_type&lt;/code&gt; enum in &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=vm/instruction_operands.hpp;hb=HEAD"&gt;instruction_operands.hpp&lt;/a&gt;. This value tells the VM what kind of value to deposit in the operand -- possibilities include a data heap pointer, the machine code of a word, and so on.&lt;/li&gt; &lt;li&gt;The &lt;i&gt;relocation class&lt;/i&gt; is an instance of the &lt;code&gt;relocation_class&lt;/code&gt; enum in &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=vm/instruction_operands.hpp;hb=HEAD"&gt;instruction_operands.hpp&lt;/a&gt;. This value tells the VM how the operand is represented -- the instruction format, whether or not it is a relative address, and such.&lt;/li&gt; &lt;li&gt;The &lt;i&gt;relocation offset&lt;/i&gt; is a byte offset from the start of the code block where the value is to be stored.&lt;/li&gt; &lt;/ul&gt; Code that needs to inspect relocation table entries uses the &lt;code&gt;each_instruction_operand()&lt;/code&gt; method defined in &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=vm/code_blocks.hpp;hb=HEAD"&gt;code_block.hpp&lt;/a&gt;. This is a template method which can accept any object overloading &lt;code&gt;operator()&lt;/code&gt;.&lt;br /&gt;&lt;h3&gt;Literal tables&lt;/h3&gt; The next part is what I just finished refactoring. I'll describe the old approach first. The simplest way, and what Factor used until now, is the following. Relocation table entries that expect a parameter, such as those that deposit addresses from the data heap and code heap, take the parameter from a literal table associated to each code block. When the compiler compiles a word, it spits out some machine code and a literal table. It hands these off to the Factor VM. The "sweep" phase of the garbage collector traces each code block's literal table, and the "compact" phase, after updating the literal table, stores the operands back in each instruction referenced from the relocation table.&lt;br /&gt;&lt;h3&gt;Eliminating literal tables&lt;/h3&gt; The problem with the old approach is that the garbage collector doesn't really need the literal table. The address of each data heap object and code block referenced from machine code is already stored in the machine code itself. Indeed, the only thing missing until now was  a way to read instruction operands out of instructions. With this in place, code blocks no longer had to hold on to the literal table after being constructed. Each code block's literal table is only used to deposit the initial values into machine code when a code block is first compiled by the compiler. Subsequently, the literal table becomes garbage and is collected by the garbage collector. When tracing code blocks, the garbage collector traverses the instruction operands themselves, using the relocation table alone. In addition to the space savings gained by not keeping these arrays of literals around, another interesting side-effect of this refactoring is that a full garbage collection no longer resets generic word call sites back to the cold call entry point, which would effectively discard all inline caches (read about &lt;a href="http://factor-language.blogspot.com/2009/05/factors-implementation-of-polymorphic.html"&gt;inline caching in Factor&lt;/a&gt;).&lt;br /&gt;&lt;h3&gt;Coping with code redefinition&lt;/h3&gt; A call to a word is compiled as a direct jump in Factor. This means that if a word is redefined and recompiled, existing call sites need to be updated to point to the new definition. The implementation of this is slightly more subtle now that literal tables are gone. Every code block in the code heap has a reference to an &lt;code&gt;owner&lt;/code&gt; object in its header (see the &lt;code&gt;code_block&lt;/code&gt; struct in &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=vm/code_blocks.cpp;hb=HEAD"&gt;code_blocks.cpp&lt;/a&gt;). The owner is either a word or a quotation. Words and quotations, in turn, have a field which references a compiled code heap block. The latter always points at the most recent compiled definition of that object (note that quotations are only ever compiled once, because they are immutable. Words, however, can be redefined by reloading source files). At the end of each compilation unit, the code heap is traversed, and each code block is updated in turn. The code block's relocation table is consulted, and instruction operands which reference compiled code heap blocks are updated. Before this would be done by overwriting all call targets from the literal table. Now, this is accomplished by looking at the owner object of the current target, and then storing the owner's most recent code block back in the instruction operand. This is implemented in the &lt;code&gt;update_word_references()&lt;/code&gt; method defined in &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=vm/code_blocks.cpp;hb=HEAD"&gt;code_blocks.cpp&lt;/a&gt;. In addition to helping with redefinition, the owner object reference is used to construct call stack traces.&lt;br /&gt;&lt;h3&gt;Additional space savings in deployed binaries&lt;/h3&gt; Normally, every compiled code block references its owner object, so that code redefinition can work. This means that if a word or quotation is live, then the code block corresponding to its most recent definition will be live, and vice versa. In deployed images where the compiler and debugger have been stripped out, words cannot be redefined and stack traces are not needed, so the owner field can be stripped out. This means that a word object can get garbage collected at deploy time even if its compiled code block is called. As it turns out, most words are never used as objects, and can be stripped out in this manner. So the literal table removal has an even bigger effect in deployed images than development images.&lt;br /&gt;&lt;h3&gt;Size comparison&lt;/h3&gt; The following table shows image sizes before and after this change on Mac OS X x86-64. &lt;table&gt; &lt;tr&gt;&lt;th&gt;Image&lt;/th&gt;&lt;th&gt;Before (Megabytes)&lt;/th&gt;&lt;th&gt;After (Megabytes)&lt;/th&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;Development image&lt;/td&gt;&lt;td&gt;92 Mb&lt;/td&gt;&lt;td&gt;90 Mb&lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;Minimal development image&lt;/td&gt;&lt;td&gt;8.6 Mb&lt;/td&gt;&lt;td&gt;8.2 Mb&lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;Deployed &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=tree;f=extra/hello-ui;hb=HEAD"&gt;hello-ui&lt;/a&gt;&lt;/td&gt;&lt;td&gt;2.3 Mb&lt;/td&gt;&lt;td&gt;1.5 Mb&lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;Deployed &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=tree;f=extra/bunny;hb=HEAD"&gt;bunny&lt;/a&gt;&lt;/td&gt;&lt;td&gt;3.5 Mb&lt;/td&gt;&lt;td&gt;3.1 Mb&lt;/td&gt;&lt;/tr&gt; &lt;/table&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17087850-1450401663353361303?l=factor-language.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://factor-language.blogspot.com/feeds/1450401663353361303/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17087850&amp;postID=1450401663353361303' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/1450401663353361303'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/1450401663353361303'/><link rel='alternate' type='text/html' href='http://factor-language.blogspot.com/2009/12/reducing-image-size-by-eliminating.html' title='Reducing image size by eliminating literal tables from code heap entries'/><author><name>Slava Pestov</name><uri>http://www.blogger.com/profile/02768382790667979877</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17087850.post-8501292410914821873</id><published>2009-11-16T03:24:00.006-05:00</published><updated>2009-11-16T13:30:06.482-05:00</updated><title type='text'>Mark-compact garbage collection for oldest generation, and other improvements</title><content type='html'>Factor now uses a mark-sweep-compact garbage collector for the oldest generation (known as tenured space), in place of the copying collector that it used before. This reduces memory usage. The mark-sweep collector used for the code heap has also been improved. It now shares code with the data heap collector and uses the same compaction algorithm. &lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Mark-sweep-compact garbage collection for tenured space&lt;/h3&gt; &lt;h4&gt;Mark phase&lt;/h4&gt; During the mark phase, the garbage collector computes the set of reachable objects by starting from the "roots"; global variables, and objects referenced from runtime stacks. When an object is visited, the mark phase checks if the object has already been marked. If it hasn't yet been marked, it is marked, and any object that it refers to are then also visited in turn. If an object has already been marked, nothing is done. As described above, the algorithm is recursive, which is problematic. There are two approaches to turn it into an iterative algorithm; either objects yet to be visited are pushed on a mark stack, and a top-level loop drains this stack, or a more complicated scheme known as "pointer inversion" is used. I decided against pointer inversion, since the mark stack approach is simpler and I have yet to observe the mark stack grow beyond 128Kb or so anyway. I use an &lt;code&gt;std::vector&lt;/code&gt; for the mark stack and it works well enough. The mark stack and the loop that drains it are implemented in &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=vm/full_collector.cpp;hb=HEAD"&gt;full_collector.cpp&lt;/a&gt;. There are several approaches to representing the set of objects which are currently marked, also. The two most common are storing a mark bit for each object in the object's header, and storing mark bits in a bitmap off to the side. I chose the latter approach, since it speeds up the sweep and compact phases of the garbage collector, and doesn't require changing object header layout. Each bit in the mark bitmap corresponds to 16 bytes of heap space. For reasons that will become clear, when an object is marked mark bitmap, every bit corresponding to space taken up by the object is marked, not just the first bit. The mark bitmap is implemented in &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=vm/mark_bits.hpp;hb=HEAD"&gt;mark_bits.hpp&lt;/a&gt;. &lt;h4&gt;Sweep phase&lt;/h4&gt; Once the mark phase is complete, the mark bitmap now has an up-to-date picture of what regions in the heap correspond to reachable objects, and which are free. The sweep phase begins by clearing the free list, and then computes a new one by traversing the mark bitmap. For every contiguous range of clear bits, a new entry is added to the free list. This is why I use a mark bitmap instead of mark bits in object headers; if I had used mark bits in headers, then the sweep phase would need to traverse the entire heap, not just the mark bitmap, which has significantly worse locality. The sweep phase is implemented by the &lt;code&gt;free_list_allocator::sweep()&lt;/code&gt; method in &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=vm/free_list_allocator.hpp;hb=HEAD"&gt;free_list_allocator.hpp&lt;/a&gt;. The key to an efficient implementation of the sweep algorithm is the &lt;code&gt;log2()&lt;/code&gt; function in &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=vm/bitwise_hacks.hpp;hb=HEAD"&gt;bitwise_hacks.hpp&lt;/a&gt;; it is used to find the first set bit in a cell. I use the &lt;code&gt;BSR&lt;/code&gt; instruction on x86 and &lt;code&gt;cntlzw&lt;/code&gt; on PowerPC. The sweep phase also needs to update the object start offset map used for &lt;a href="http://factor-language.blogspot.com/2009/10/improved-write-barriers-in-factors.html"&gt;the card marking write barrier&lt;/a&gt;. When collecting a young generation, the garbage collector scans the set of marked cards. It needs to know where the first object in each card is, so that it can properly identify pointers. This information is maintained by the &lt;code&gt;object_start_map&lt;/code&gt; class defined in &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=vm/object_start_map.cpp;hb=HEAD"&gt;object_start_map.cpp&lt;/a&gt;. If an object that happens to be the first object in a cardwas deallocated by the sweep phase, the object start map must be updated to point at a subsequent object in that card. This is done by the &lt;code&gt;object_start_map::update_for_sweep()&lt;/code&gt; method. &lt;h4&gt;Compact phase&lt;/h4&gt; The compact phase is optional; it only runs if the garbage collector determines that there is too much fragmentation, or if the user explicitly requests it. The compact phase does not require that the sweep phase has been run, only the mark phase. Like the sweep phase, the compact phase relies on the mark bitmap having been computed. Whereas the sweep phase identifies free blocks and adds them to the free list, the compact phase slides allocated space up so that all free space ends up at the end of the heap, in a single block. The compact phase has two steps. The first step computes a forwarding map; a data structure that can tell you the final destination of every heap block. It is easy to see that the final destination of every block can be determined from the number of set bits in the mark bitmap that precede it. Since the forwarding map is consulted frequently -- once for every pointer in every object that is live during compaction -- it is important that lookups are as fast as possible. The forwarding map should also be space-efficient. This completely rules out using a hashtable (with many small objects in the heap, it would grow to be almost as big as the heap itself) or simply scanning the bitmap and counting bits every time (since now compaction will become an &lt;code&gt;O(n^2)&lt;/code&gt; algorithm). The correct solution is very simple, and well-known in language implementation circles, but I wasn't aware of it until I studied the &lt;a href="http://ccl.clozure.com/manual/chapter16.4.html"&gt;Clozure Common Lisp garbage collector&lt;/a&gt;. You count the bits set in every group of 32 (or 64) bits in the mark bitmap, building an array of cumulative sums as you go. Then, to count the number of bits that are set up to a given element, you look up the pre-computed population count for the nearest 32 (or 64) bit boundary, and manually compute the population count for the rest. This gives you a forwarding map with &lt;code&gt;O(1)&lt;/code&gt; lookup time. This algorithm relies on a fast population count algorithm; I used the standard CPU-independent technique in the &lt;code&gt;popcount()&lt;/code&gt; function of &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=vm/bitwise_hacks.hpp;hb=HEAD"&gt;bitwise_hacks.hpp&lt;/a&gt;. Nowadys, as of SSE 4.2, x86 CPUs even include a &lt;code&gt;POPCNT&lt;/code&gt; instruction, but since compaction spends most of its time in &lt;code&gt;memmove()&lt;/code&gt;, I didn't investigate if this would offer a speedup. It would require a CPUID check at runtime and the fallback would still need to be there for pre-Intel i7 CPUs and PowerPC, so it didn't seem worth the extra complexity to me. Once the forwarding map has been computed, objects can be moved up and pointers that they contain updated in one pass. A mark-compact cycle takes roughly twice as long as a mark-sweep, which is why I elected not to perform a mark-compact on every full collection. The latter leads to a simpler implementation (no sweep phase, and no free list; allocation in tenured space is done by bumping a pointer just as with a copying collector) however the performance penalty didn't seem worth the minor code size reduction to me. &lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Code heap compaction&lt;/h3&gt; Code heap compaction has been in Factor for a while, in the form of the &lt;a href="http://docs.factorcode.org/content/word-save-image-and-exit,memory.html"&gt;save-image-and-exit&lt;/a&gt; word. This used an inefficient multi-pass algorithm (known as the "LISP-2 compaction algorithm") however since it only ran right before exiting Factor, it didn't matter too much. Now, code heap compaction can happen at any time as a result of heap fragmentation, and uses the same efficient algorithm as tenured space compaction. Compaction moves code around, and doing this at a time other than right before the VM exiting creates a few complications: &lt;ul&gt; &lt;li&gt;Return addresses in the callstack need to be updated.&lt;/li&gt; &lt;li&gt;If an inline cache misses, a new inline cache is compiled, and the call site for the old cache is patched. Inline cache construction allocates memory and can trigger a GC, which may in turn trigger a code heap compaction; if this occurs, the return address passed into the inline cache miss stub may have moved, and the code to patch the call site needs to be aware of this.&lt;/li&gt; &lt;li&gt;If a Factor callback is passed into C code, then moving code around in the code heap may invalidate the callback, and the garbage collector has no way to update any function pointers that C libraries might be referencing.&lt;/li&gt; &lt;/ul&gt; The solution to the first problem was straightforward and involved adding some additional code to the code heap compaction pass. The second problem is trickier. I added a new &lt;code&gt;code_root&lt;/code&gt; smart pointer class, defined in &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=vm/code_roots.hpp;hb=HEAD"&gt;code_roots.hpp&lt;/a&gt;, to complement the &lt;code&gt;data_root&lt;/code&gt; smart pointer (see my blog post about &lt;a href="http://factor-language.blogspot.com/2009/05/factor-vm-ported-to-c.html"&gt;moving the VM to C++&lt;/a&gt; for details about that). The inline cache compiler wraps the return address in a &lt;code&gt;code_root&lt;/code&gt; to ensure that if the code block that contains it is moved by any allocations, the return address can be updated properly. I solved the last problem with a layer of indirection (that's how all problems are solved in CS, right?). Callbacks are still allocated in the code heap, but the function pointer passed to C is actually stored in a separate "callback heap" where every block consists of a single jump instruction and nothing else. When a code heap compaction occurs, code blocks in the code heap might be moved around, and all jumps in the callback heap are updated. Blocks within the callback heap are never moved around (or even deallocated, since that isn't safe either). &lt;br /&gt;&lt;br /&gt;&lt;h3&gt;New object alignment and tagging scheme&lt;/h3&gt; The last major change I wanted to discuss is that objects are now aligned on 16-byte boundaries, rather than 8-byte boundaries. This wastes more memory, but I've already offset most of the increase with some space optimizations, with more to follow. There are several benefits to this new system. First of all, the binary payload of byte array objects now begins on a 16-byte boundary, which allows SIMD intrinsics to use aligned access instructions, which are much faster. Second, it simplifies the machine code for inline caches and megamorphic method dispatch. Since the low 4 bits of every pointer are now clear, this allows all built-in VM types to fit in the pointer tag itself, so the logic to get a class descriptor for an object in a dispatch stub is very simple now; here is pseudo-code for the assembly that gets generated: &lt;pre&gt;cell tag = ptr &amp;amp; 15; if(tag == tuple_tag) tag = ptr[cell - tuple_tag]; ... dispatch on tag ...&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17087850-8501292410914821873?l=factor-language.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://factor-language.blogspot.com/feeds/8501292410914821873/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17087850&amp;postID=8501292410914821873' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/8501292410914821873'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/8501292410914821873'/><link rel='alternate' type='text/html' href='http://factor-language.blogspot.com/2009/11/mark-compact-garbage-collection-for.html' title='Mark-compact garbage collection for oldest generation, and other improvements'/><author><name>Slava Pestov</name><uri>http://www.blogger.com/profile/02768382790667979877</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17087850.post-5675966809102446507</id><published>2009-10-15T07:16:00.009-04:00</published><updated>2009-10-16T00:13:07.306-04:00</updated><title type='text'>Improved write barriers in Factor's garbage collector</title><content type='html'>The Factor compiler has been evolving very quickly lately; it has been almost completely rewritten several times in the last couple of years. The garbage collector, on the other hand, hasn't seen nearly as much action. The last time I did any work on it was &lt;a href="http://factor-language.blogspot.com/2008/05/garbage-collection-throughput.html"&gt;May 2008&lt;/a&gt;, and before that, &lt;a href="http://factor-language.blogspot.com/2007/05/garbage-collector-improvements.html"&gt;May 2007&lt;/a&gt;. Now more than a year later, I've devoted a week or so to working on it. The result is a cleaner implementation, and improved performance.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Code restructuring&lt;/h3&gt;I re-organized the garbage collector code to be more extensible and easier to maintain. I did this by splitting off a bunch of the garbage collector methods from the &lt;code&gt;factor_vm&lt;/code&gt; class into their own set of classes. I made extensive use of template metaprogramming to help structure code in a natural way. Many people dislike C++, primarily because of templates, but I don't feel that way at all. Templates are my favorite C++ feature, and if it wasn't for templates C++ would just be a shitty object-oriented dialect of C.&lt;br /&gt;&lt;br /&gt;First up is the &lt;code&gt;collector&lt;/code&gt; template class, defined in &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=vm/collector.hpp;hb=HEAD"&gt;collector.hpp&lt;/a&gt;:&lt;pre&gt;template&amp;lt;typename TargetGeneration, typename Policy&gt; struct collector&lt;/pre&gt;This class has two template parameters:&lt;ul&gt;&lt;li&gt;&lt;code&gt;TargetGeneration&lt;/code&gt; - this is the generation that the collector will be copying objects to. A generation is any class that implements the &lt;code&gt;allot()&lt;/code&gt; method.&lt;/li&gt;&lt;li&gt;&lt;code&gt;Policy&lt;/code&gt; - this is a class that simulates a higher-order function. It implements a &lt;code&gt;should_copy_p()&lt;/code&gt; method that tells the collector if a given object should be promoted to the target generation, or left alone.&lt;/li&gt;&lt;/ul&gt;On its own, the collector class can't do any garbage collection itself; it just implements methods which trace GC roots: &lt;code&gt;trace_contexts()&lt;/code&gt; (traces active stacks), &lt;code&gt;trace_roots()&lt;/code&gt; (traces global roots), and &lt;code&gt;trace_handle()&lt;/code&gt; (traces one pointer).&lt;br /&gt;&lt;br /&gt;Next up is the &lt;code&gt;copying_collector&lt;/code&gt; template class, defined in &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=vm/copying_collector.hpp;hb=HEAD"&gt;copying_collector.hpp&lt;/a&gt;:&lt;pre&gt;template&amp;lt;typename TargetGeneration, typename Policy&gt; struct copying_collector&lt;/pre&gt;This class has the same two template parameters as &lt;code&gt;collector&lt;/code&gt;; the target generation must define one additional method, &lt;code&gt;next_object_after()&lt;/code&gt;. This is used when scanning newly copied objects. This class implements logic for scanning marked cards, as well as Cheney's algorithm for copying garbage collection.&lt;br /&gt;&lt;br /&gt;Then, there are four subclasses implementing each type of GC pass:&lt;ul&gt;&lt;li&gt;&lt;code&gt;nursery_collector&lt;/code&gt; - copies live objects from the nursery into aging space, defined in &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=vm/nursery_collector.hpp;hb=HEAD"&gt;nursery_collector.hpp&lt;/a&gt; and &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=vm/nursery_collector.cpp;hb=HEAD"&gt;nursery_collector.cpp&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;code&gt;aging_collector&lt;/code&gt; - copies live objects from the first aging semi-space to  the second aging semi-space, defined in &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=vm/aging_collector.hpp;hb=HEAD"&gt;aging_collector.hpp&lt;/a&gt; and &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=vm/aging_collector.cpp;hb=HEAD"&gt;aging_collector.cpp&lt;/a&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;code&gt;to_tenured_collector&lt;/code&gt; - copies live objects from aging space into tenured space, defined in &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=vm/to_tenured_collector.hpp;hb=HEAD"&gt;to_tenured_collector.hpp&lt;/a&gt; and &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=vm/to_tenured_collector.cpp;hb=HEAD"&gt;to_tenured_collector.cpp&lt;/a&gt;&lt;br /&gt;&lt;/code&gt;&lt;/li&gt;&lt;li&gt;&lt;code&gt;full_collector&lt;/code&gt; - copies live objects from the first tenured semi-space to the second tenured semi-space, defined in &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=vm/nursery_collector.hpp;hb=HEAD"&gt;nursery_collector.hpp&lt;/a&gt; and &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=vm/nursery_collector.cpp;hb=HEAD"&gt;nursery_collector.cpp&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;Each class subclasses &lt;code&gt;copying_collector&lt;/code&gt; and specializes the two template arguments. For example, let's take a look at the declaration of the nursery collector:&lt;pre&gt;struct nursery_collector : copying_collector&amp;lt;aging_space,nursery_policy&gt;&lt;/pre&gt;This subclass specializes its superclass to copy objects to tenured space, using the following policy class:&lt;br /&gt;&lt;pre&gt;struct nursery_policy {&lt;br /&gt; factor_vm *myvm;&lt;br /&gt;&lt;br /&gt; nursery_policy(factor_vm *myvm_) : myvm(myvm_) {}&lt;br /&gt;&lt;br /&gt; bool should_copy_p(object *untagged)&lt;br /&gt; {&lt;br /&gt;  return myvm-&gt;nursery.contains_p(untagged);&lt;br /&gt; }&lt;br /&gt;};&lt;/pre&gt;&lt;br /&gt;That is, any object that is in the nursery will be copied to aging space by the &lt;code&gt;nursery_collector&lt;/code&gt;. Other collector subclasses are similar.&lt;br /&gt; &lt;br /&gt;This code all uses templates, rather than virtual methods, so every collector pass will have a specialized code path generated for it. This gives higher performance, with cleaner code, than was is possible in C. The old garbage collector was a tangle of conditionals, C functions, and global state.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Partial array card marking&lt;/h3&gt;When a pointer to an object in the nursery is stored into a container in aging or tenured space, the container object must be added to a "remembered set" so that on the next minor collection, so that it can be scanned, and its elements considered as GC roots.&lt;br /&gt;&lt;h4&gt;Old way&lt;/h4&gt;Storing a pointer into an object marks the card containing the header of the object. On a minor GC, all marked cards are be scanned; if a marked card was bound, then every object whose header is contained in this card would be scanned.&lt;br /&gt;&lt;h4&gt;Problem&lt;/h4&gt;Storing a pointer into an array would necessitate the array to be scanned in its entirety on the next minor collection. This is bad if the array is large. Consider an algorithm that stores successive elements into an array on every iteration, and also performs enough work per iteration to trigger a nursery collection. Now every nursery collection -- and hence every iteration of the loop -- is scanning the entire array. We're doing a quadratic amount of work for what should be a linear-time algorithm.&lt;br /&gt;&lt;h4&gt;New way&lt;/h4&gt;Storing a pointer into an object now marks the card containing the slot that was mutated. On a minor GC, all marked cards are scanned. Every object in every marked card is inspected, but only the subrange of slots that fit inside the card are scanned. This greatly reduces the burden placed on the GC from mutation of large arrays. The implementation is tricky. I need to spend some time thinking about and simplifying the code, as it stands the card scanning routine has three nested loops, and two usages of goto!&lt;br /&gt;&lt;h4&gt;Implementation&lt;/h4&gt;&lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=vm/copying_collector.hpp;hb=HEAD"&gt;copying_collector.hpp&lt;/a&gt;, &lt;code&gt;trace_cards()&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;New object promotion policy&lt;/h3&gt;When aging space is being collected, objects contained in marked cards in tenured space must be traced.&lt;br /&gt;&lt;h4&gt;Old way&lt;/h4&gt;These cards would be scanned, but could not be unmarked, since the objects they refer to were copied to the other aging semi-space, and would need to be traced on the next aging collection.&lt;br /&gt;&lt;h4&gt;The problem&lt;/h4&gt;The old way was bad because these cards would remain marked for a long time, and would be re-scanned on every collection. Furthermore the objects they reference would likely live on for a long time, since they're referenced from a tenured object, and would needlessly bounce back and forth between the two aging semi-spaces.&lt;br /&gt;&lt;h4&gt;New way&lt;/h4&gt;Instead, an aging collection proceeds in two phases: the first phase promotes aging space objects referenced from tenured space to tenured space, unmarking all marked cards. The second phase copies all reachable objects from aging to second aging semi-space. This promotes objects likely to live for a long time all the way to tenured space, and scans less cards on an aging collection since more cards can get unmarked.&lt;br /&gt;&lt;h4&gt;Implementation&lt;/h4&gt;&lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=vm/aging_collector.cpp;hb=HEAD"&gt;aging_collector.cpp&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Faster code heap remembered set&lt;/h3&gt;If a code block references objects in the nursery, the code block needs to be updated after a nursery collection. This is because the machine code of compiled words directly refers to objects; there's no indirection through a literal table at runtime. This improves performance but increases garbage collector complexity.&lt;br /&gt;&lt;h4&gt;Old way&lt;/h4&gt;When a new code block was allocated, a global flag would be set. A flag would also be set in the code block's header. On the next nursery collection, the entire code heap would be scanned, and any code blocks with this flag set in them would have their literals traced by the garbage collector.&lt;br /&gt;&lt;h4&gt;New way&lt;/h4&gt;The problem with the old approach is that adding a single code block entails a traversal of the entire code heap on the next minor GC, which is bad for cache. While most code does not allocate in the code heap, the one major exception is the compiler itself. When loading source code, a significant portion of running time was spent scanning the code heap during minor collections. Now, the list of code blocks containing literals which refer to the nursery and aging spaces are stored in a pair of STL sets. On a nursery or aging collection, these sets are traversed and the code blocks they contain are traced. These sets are typically very small, and in fact empty most of the time.&lt;br /&gt;&lt;h4&gt;Implementation&lt;/h4&gt;&lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=vm/code_heap.cpp;hb=HEAD"&gt;code_heap.cpp&lt;/a&gt;, &lt;code&gt;write_barrier()&lt;/code&gt;&lt;br /&gt;&lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=vm/copying_collector.hpp;hb=HEAD"&gt;copying_collector.hpp&lt;/a&gt;, &lt;code&gt;trace_code_heap_roots()&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Faster card marking and object allocation&lt;/h3&gt;The compiler's code generator now generates tighter code for common GC-related operations too. A write barrier looks like this in pseudo-C:&lt;pre&gt;cards[(obj - data_heap_start) &gt;&gt; card_bits] = card_mark_mask;&lt;/pre&gt;Writing out the pointer arithmetic by hand, we get:&lt;pre&gt;*(cards + (obj - data_heap_start) &gt;&gt; card_bits) = card_mark_mask;&lt;/pre&gt;Re-arranging some operations,&lt;pre&gt;*(obj &gt;&gt; card_bits + (cards - data_heap_start &gt;&gt; card_bits) = card_mark_mask;&lt;/pre&gt;Now, the entire expression &lt;pre&gt;(cards - data_heap_start &gt;&gt; card_bits)&lt;/pre&gt; is a constant. Factor stores this in a VM-global variable, named &lt;code&gt;cards_offset&lt;/code&gt;. The value used to be loaded from the global variable every time a write barrier would execute. Now, its value is inlined directly into machine code. This requires code heap updates if the data heap grows, since then either the data heap start or the card array base pointer might change. However, the upside is that it eliminates several instructions from the write barrier. Here is a sample generated write barrier sequence; only 5 instructions on x86.32:&lt;pre&gt;0x08e3ae84: lea    (%edx,%eax,1),%ebp&lt;br /&gt;0x08e3ae87: shr    $0x8,%ebp&lt;br /&gt;0x08e3ae8a: movb   $0xc0,0x20a08000(%ebp)&lt;br /&gt;0x08e3ae91: shr    $0xa,%ebp&lt;br /&gt;0x08e3ae94: movb   $0xc0,0x8de784(%ebp)&lt;/pre&gt;&lt;br /&gt;Object allocations had a slight inefficiency; the code generated to compute the effective address of the nursery allocation pointer did too much arithmetic. Adding support to the VM for embedding offsets of VM global variables directly into machine code saved one instruction from every object allocation. Here is some generated machine code to box a floating point number; only 6 instructions on x86.32 (of course Factor does &lt;a href="http://factor-language.blogspot.com/2009/08/global-float-unboxing-and-some-other.html"&gt;float unboxing&lt;/a&gt; to make your code even faster):&lt;pre&gt;0x08664a33: mov    $0x802604,%ecx&lt;br /&gt;0x08664a38: mov    (%ecx),%eax&lt;br /&gt;0x08664a3a: movl   $0x18,(%eax)&lt;br /&gt;0x08664a40: or     $0x3,%eax&lt;br /&gt;0x08664a43: addl   $0x10,(%ecx)&lt;br /&gt;0x08664a46: movsd  %xmm0,0x5(%eax)&lt;/pre&gt;&lt;h4&gt;Implementation&lt;/h4&gt;&lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=basis/cpu/x86/x86.factor;hb=HEAD"&gt;cpu/x86/x86.factor&lt;/a&gt;: &lt;code&gt;%write-barrier&lt;/code&gt;, &lt;code&gt;%allot&lt;/code&gt;&lt;br /&gt;&lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=basis/cpu/ppc/ppc.factor;hb=HEAD"&gt;cpu/ppc/ppc.factor&lt;/a&gt;: &lt;code&gt;%write-barrier&lt;/code&gt;, &lt;code&gt;%allot&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Performance comparison&lt;/h3&gt;&lt;br /&gt;I compared the performance of a Mac OS X x86-32 build from October 5th, with the latest sources as of today.&lt;br /&gt;&lt;br /&gt;Bootstrap time saw a nice boost, going from 363 seconds, down to 332 seconds.&lt;br /&gt;&lt;br /&gt;The time to load and compile all libraries in the source tree (&lt;code&gt;load-all&lt;/code&gt;) was reduced from 537 seconds to 426 seconds.&lt;br /&gt;&lt;br /&gt;Here is a microbenchmark demonstrating the faster card marking in a very dramatic way:&lt;br /&gt;&lt;pre&gt;: bench ( -- seq ) 10000000 [ &gt;bignum 1 + ] map ;&lt;/pre&gt;&lt;br /&gt;The best time out of 5 iterations on the old build was 12.9 seconds. Now, it has gone down to 1.9 seconds.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17087850-5675966809102446507?l=factor-language.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://factor-language.blogspot.com/feeds/5675966809102446507/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17087850&amp;postID=5675966809102446507' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/5675966809102446507'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/5675966809102446507'/><link rel='alternate' type='text/html' href='http://factor-language.blogspot.com/2009/10/improved-write-barriers-in-factors.html' title='Improved write barriers in Factor&apos;s garbage collector'/><author><name>Slava Pestov</name><uri>http://www.blogger.com/profile/02768382790667979877</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17087850.post-2579054926441220099</id><published>2009-09-30T23:52:00.007-04:00</published><updated>2009-10-01T17:13:37.022-04:00</updated><title type='text'>A survey of domain-specific languages in Factor</title><content type='html'>Factor has good support for implementing mini-languages as libraries. In this post I'll describe the general techniques and look at some specific examples. I don't claim any of this is original research -- Lisp and Forth programmers have been doing DSLs for decades, and recently the Ruby, Haskell and even Java communities are discovering some of these concepts and adding a few of their own to the mix. However, I believe Factor brings some interesting incremental improvements to the table, and the specific combination of capabilities found in Factor is unique.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Preliminaries&lt;/h3&gt;&lt;br /&gt;How does one embed a mini-language in Factor? Let us look at what goes on when a source file is parsed:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;The parser reads the input program. This consists of definitions and top-level forms. The parser constructs syntax trees, and adds definitions to the dictionary. The result of parsing is the set of top-level forms in the file.&lt;/li&gt;&lt;li&gt;The compiler is run with all changed definitions. The compiler essentially takes syntax trees as input, and produces machine code as output. Once the compiler finishes compiling the new definitions, they are added to the VM's code heap and may be executed.&lt;/li&gt;&lt;li&gt;The top level forms in the file are run, if any.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;In Factor, all of these stages are extensible. Note that all of this happens when the source file is loaded into memory -- Factor is biased towards compile-time meta-programming.&lt;br /&gt;&lt;br /&gt;&lt;h4&gt;Extending the parser with parsing words&lt;/h4&gt;&lt;br /&gt;Parsing words which execute at parse time can be defined. Parsing words can take over the parser entirely and parse custom syntax. All of Factor's standard syntax, such as &lt;a href="http://docs.factorcode.org/content/word-__colon__,syntax.html"&gt;:&lt;/a&gt; for defining words and &lt;a href="["&gt;[&lt;/a&gt; for reading a quotation, is actually parsing words defined in the &lt;a href="http://docs.factorcode.org/content/vocab-syntax.html"&gt;syntax&lt;/a&gt; vocabulary. Commonly-used libraries such as &lt;a href="http://docs.factorcode.org/content/vocab-memoize.html"&gt;memoize&lt;/a&gt; and &lt;a href="http://docs.factorcode.org/content/vocab-specialized-arrays.html"&gt;specialized-arrays&lt;/a&gt; add their own parsing words for custom definitions and data types. These don't qualify as domain-specific languages since they're too trivial, but they're very useful.&lt;br /&gt;&lt;br /&gt;&lt;h4&gt;Homoiconic syntax&lt;/h4&gt;&lt;br /&gt;In most mainstream languages, the data types found in the syntax tree are quite different from the data types you operate on at runtime. For example, consider the following Java program:&lt;br /&gt;&lt;pre&gt;if(x &lt; 3) { return x + 3; } else { return foo(x); }&lt;/pre&gt;&lt;br /&gt;This might parse into something like&lt;br /&gt;&lt;pre&gt;IfNode(&lt;br /&gt;    condition: LessThan(Identifier(x),Integer(3))&lt;br /&gt;    trueBranch: ReturnNode(Add(Identifier(x),Integer(3)))&lt;br /&gt;    falseBranch: ReturnNode(MethodCall(foo,Identifier(x)))&lt;br /&gt;)&lt;/pre&gt;&lt;br /&gt;You cannot work with an "if node" in your program, the identifier &lt;code&gt;x&lt;/code&gt; does not exist at runtime, and so on.&lt;br /&gt;&lt;br /&gt;In Factor and Lisp, the parser constructs objects such as strings, numbers and lists directly, and identifiers (known as symbols in Lisp, and words in Factor) are first-class types. Consider the following Factor code,&lt;br /&gt;&lt;pre&gt;x 3 &amp;lt; [ x 3 + ] [ x foo ] if&lt;/pre&gt;&lt;br /&gt;This parses as a quotation of 6 elements,&lt;br /&gt;&lt;ul&gt;&lt;li&gt;The word &lt;code&gt;x&lt;/code&gt;&lt;/li&gt;&lt;li&gt;The integer 3&lt;/li&gt;&lt;li&gt;The word &lt;code&gt;&amp;lt;&lt;/code&gt;&lt;li&gt;A quotation with three elements, &lt;code&gt;x&lt;/code&gt;, &lt;code&gt;3&lt;/code&gt;, and &lt;code&gt;+&lt;/code&gt;&lt;/li&gt;&lt;li&gt;A quotation with two elements, &lt;code&gt;x&lt;/code&gt;, and &lt;code&gt;foo&lt;/code&gt;&lt;/li&gt;&lt;li&gt;The word &lt;code&gt;if&lt;/code&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;An array literal, like &lt;code&gt;{ 1 2 3 }&lt;/code&gt;, parses as an array of three integers; not an AST node representing an array with three child AST nodes representing integers.&lt;br /&gt;&lt;br /&gt;The flipside of homoiconic syntax is being able to print out (almost) any data in a form that can be parsed back in; in Factor, the &lt;a href="http://docs.factorcode.org/content/word-.,prettyprint.html"&gt;.&lt;/a&gt; word does this.&lt;br /&gt;&lt;br /&gt;What homoiconic syntax gives you as a library author is the ability to write mini-languages without writing a parser. Since you can input almost any data type wth literal syntax, you can program your mini-language in parse trees directly. The mini-language can process nested arrays, quotations, tuples and words instead of parsing a string representation first.&lt;br /&gt;&lt;/li&gt;&lt;br /&gt;&lt;br /&gt;&lt;h4&gt;Compile-time macros&lt;/h4&gt;&lt;br /&gt;All of Factor's data types, including quotations and words, can be constructed at runtime. Factor also supports &lt;a href="http://docs.factorcode.org/content/vocab-macros.html"&gt;compile-time macros&lt;/a&gt; in the Lisp sense, but unlike Lisp where they are used to prevent evaluation, a Factor macro is called just like an ordinary word, except the parameters have to be compile-time literals. The macro evaluates to a quotation, and the quotation is compiled in place of the macro call.&lt;br /&gt;&lt;br /&gt;Everything that can be done with macros can also be done by constructing quotations at runtime and using &lt;a href="http://docs.factorcode.org/content/word-call(,syntax.html"&gt;call(&lt;/a&gt; -- macros just provide a speed boost.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Parsing word-based DSLs&lt;/h3&gt;&lt;br /&gt;Many DSLs have a parsing word as their main entry point. The parsing word either takes over the parser completely to parse custom syntax, or it defines some new words and then lets the main parser take over again.&lt;br /&gt;&lt;br /&gt;&lt;h4&gt;Local variables&lt;/h4&gt;&lt;br /&gt;A common misconception among people taking a casual look at Factor is that it doesn't offer any form of lexical scoping or named values at all. For example, Reg Braithwaite authoritatively states on &lt;a href="http://weblog.raganwald.com/2008/03/tool-time.html"&gt;his weblog&lt;/a&gt;:&lt;br /&gt;&lt;blockquote&gt;&lt;i&gt;the Factor programming language imposes a single, constraining set of rules on the programmer: programmers switching to Factor must relinquish their local variables to gain Factor’s higher-order programming power.&lt;/i&gt;&lt;/blockquote&gt;&lt;br /&gt;In fact, Factor supports lexically scoped local variables via the &lt;a href="http://docs.factorcode.org/content/vocab-locals.html"&gt;locals&lt;/a&gt; vocabulary. and this library is used throughout the codebase. It looks like in a default image, about 1% of all words use lexical variables.&lt;br /&gt;&lt;br /&gt;The locals vocabulary implements a set of parsing words which augment the standard defining words. For example, &lt;a href="http://docs.factorcode.org/content/word-__colon____colon__,locals.html"&gt;::&lt;/a&gt; reads a word definition where input arguments are stored in local variables, instead of being on the stack:&lt;br /&gt;&lt;pre&gt;:: lerp ( a b c -- d )&lt;br /&gt;    a c *&lt;br /&gt;    b 1 c - *&lt;br /&gt;    + ;&lt;/pre&gt;&lt;br /&gt;The locals vocabulary also supports "let" statements, lambdas with full lexical closure semantics, and mutable variables. The locals vocabulary compiles lexical variable usage down to stack shuffling, and &lt;a href="http://docs.factorcode.org/content/word-curry,kernel.html"&gt;curry&lt;/a&gt; calls (for constructing quotations that close over a variable). This makes it quite efficient, especially since in many cases the Factor compiler can eliminate the closure construction using escape analysis. The choice of whether or not to use locals is one that can be made purely on a stylistic basis, since it has very little effect on performance.&lt;br /&gt;&lt;br /&gt;&lt;h4&gt;Parsing expression grammars&lt;/h4&gt;&lt;br /&gt;&lt;a href="http://en.wikipedia.org/wiki/Parsing_expression_grammar"&gt;Parsing expression grammars&lt;/a&gt; describe a certain class of languages, as well as a formalism for parsing these languages.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://bluishcoder.co.nz"&gt;Chris Double&lt;/a&gt; implemented a PEG library in Factor. The &lt;a href="http://docs.factorcode.org/content/vocab-peg.html"&gt;peg&lt;/a&gt; vocabulary offers a combinator-style interface for constructing parsers, and &lt;a href="http://www.bluishcoder.co.nz/2008/04/factor-parsing-dsl.html"&gt;peg.ebnf&lt;/a&gt; builds on top of this and defines a declarative syntax for specifying parsers.&lt;br /&gt;&lt;br /&gt;A simple example of a PEG grammar can be found in Chris Double's &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=tree;f=extra/peg/pl0;hb=HEAD"&gt;peg.pl0&lt;/a&gt; vocabulary. More elaborate examples can be found in &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=tree;f=extra/peg/javascript;hb=HEAD"&gt;peg.javascript&lt;/a&gt; (JavaScript parser by Chris Double) and &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=tree;f=extra/smalltalk;hb=HEAD"&gt;smalltalk.parser&lt;/a&gt; (Smalltalk parser by me).&lt;br /&gt;&lt;br /&gt;One downside of PEGs is that they have some performance problems; the standard formulation has exponential runtime in the worst case, and the "Packrat" variant that Factor uses runs in linear time but also linear space. For heavy-duty parsing, it appears as if LL and LR parsers are best, and it would be nice if Factor had an implementation of such a parser generator.&lt;br /&gt;&lt;br /&gt;However, PEGs are still useful for simple parsing tasks and prototyping. They are used throughout the Factor codebase for many things, including but not limited to:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Parsing regular expressions in the &lt;a href="http://docs.factorcode.org/content/vocab-regexp.html"&gt;regexp&lt;/a&gt; vocabulary (&lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=tree;f=basis/regexp/parser;hb=HEAD"&gt;source&lt;/a&gt;)&lt;/li&gt;&lt;li&gt;Parsing URLs in the &lt;a href="http://docs.factorcode.org/content/vocab-urls.html"&gt;urls&lt;/a&gt; vocabulary (&lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=tree;f=basis/urls;hb=HEAD"&gt;source&lt;/a&gt;)&lt;/li&gt;&lt;li&gt;Parsing HTTP headers in the &lt;a href="http://docs.factorcode.org/content/vocab-http.server.html"&gt;http.server&lt;/a&gt; and &lt;a href="http://docs.factorcode.org/content/vocab-http.client.html"&gt;http.client&lt;/a&gt; vocabularies (&lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=tree;f=basis/http/parsers;hb=HEAD"&gt;source&lt;/a&gt;)&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;PEGs can also be used in conjunction with parsing words to embed source code written with custom grammars in Factor source files directly. The next DSL is an example of that.&lt;br /&gt;&lt;br /&gt;&lt;h4&gt;Infix expressions&lt;/h4&gt;&lt;br /&gt;Philipp Brüschweiler's &lt;a href="http://docs.factorcode.org/content/vocab-infix.html"&gt;infix&lt;/a&gt; vocabulary defines a parsing word which parses an infix math expression using PEGs. The result is then compiled down to &lt;a href="http://docs.factorcode.org/content/vocab-locals.html"&gt;locals&lt;/a&gt;, which in turn compile down to stack code.&lt;br /&gt;&lt;br /&gt;Here is a word which solves a quadratic equation &lt;code&gt;ax^2 + bx + c = 0&lt;/code&gt; using the quadratic formula. Infix expressions can only return one value, so this word computes the first root only:&lt;br /&gt;&lt;pre&gt;USING: infix locals ;&lt;br /&gt;&lt;br /&gt;:: solve-quadratic ( a b c -- r )&lt;br /&gt;    [infix (-b + sqrt(b*b-4*a*c))/2*a infix] ;&lt;/pre&gt;&lt;br /&gt;Note that we're using two mini-languages here; &lt;code&gt;::&lt;/code&gt; begins a definition with named parameters stored in local variables, and &lt;code&gt;[infix&lt;/code&gt; parses an infix expression which can access these local variables.&lt;br /&gt;&lt;br /&gt;&lt;h4&gt;XML literals&lt;/h4&gt;&lt;br /&gt;&lt;a href="http://useless-factor.blogspot.com"&gt;Daniel Ehrenberg's&lt;/a&gt; XML library defines a convenient syntax for constructing XML documents. Dan describes it in detail in &lt;a href="http://useless-factor.blogspot.com/2009/01/factor-supports-xml-literal-syntax.html"&gt;a blog post&lt;/a&gt;, with plenty of examples, so I won't repeat it here. The neat thing here is that by adding a pair of parsing words, &lt;code&gt;[XML&lt;/code&gt; and &lt;code&gt;&amp;lt;XML&lt;/code&gt;, he was able to integrate XML snippets into Factor, with parse-time well-formedness checking, no less.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://docs.factorcode.org/content/vocab-xml.html"&gt;Dan's XML library&lt;/a&gt; is now used throughout the Factor codebase, particularly in the web framework, for both parsing and generating XML. For example, the &lt;a href="http://concatenative.org"&gt;concatenative.org wiki&lt;/a&gt; uses a markup language called "farkup". The &lt;a href="http://concatenative.org/wiki/view/Farkup"&gt;farkup&lt;/a&gt; markup language, developed by Dan, &lt;a href="http://code-factor.blogspot.com"&gt;Doug Coleman&lt;/a&gt; and myself, makes heavy use of XML literals. Farkup is implemented by first parsing the markup into an abstract syntax tree, and then converting this to HTML using a recursive tree walk that builds XML literals. We avoid constructing XML and HTML through raw string concatenation; instead we use XML literals everywhere now. This results in &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=tree;f=basis/farkup;hb=HEAD"&gt;cleaner, more secure code&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Compare the design of our farkup library with &lt;a href="http://code.unicoders.org/browser/django/trunk/contrib/markdown.py"&gt;markdown.py&lt;/a&gt; used by &lt;a href="http://reddit.com"&gt;reddit.com&lt;/a&gt;. The latter is implemented with a series of regular expression hacks and lots of ad-hoc string processing which attempts to produce something resembling HTML in the end. New markup injection attacks are found all the time; there was a particularly clever one involving a JavaScript worm that knocked reddit right down a few days ago. I don't claim that Farkup is 100% secure by any means, and certainly it has had an order of magnitude less testing, but without a doubt centralizing XHTML generation makes it much easier to audit and identify potential injection problems.&lt;br /&gt;&lt;br /&gt;&lt;h4&gt;C library interface&lt;/h4&gt;&lt;br /&gt;Our C library interface (or FFI) was quite low-level initially but after a ton of work by Alex Chapman, Joe Groff, and myself, it has quite a DSLish flavor. C library bindings resemble C header files on mushrooms. Here is a taste:&lt;br /&gt;&lt;pre&gt;TYPEDEF: int cairo_bool_t&lt;br /&gt;&lt;br /&gt;CONSTANT: CAIRO_CONTENT_COLOR HEX: 1000&lt;br /&gt;CONSTANT: CAIRO_CONTENT_ALPHA HEX: 2000&lt;br /&gt;CONSTANT: CAIRO_CONTENT_COLOR_ALPHA HEX: 3000&lt;br /&gt;&lt;br /&gt;STRUCT: cairo_matrix_t&lt;br /&gt;    { xx double }&lt;br /&gt;    { yx double }&lt;br /&gt;    { xy double }&lt;br /&gt;    { yy double }&lt;br /&gt;    { x0 double }&lt;br /&gt;    { y0 double } ;&lt;br /&gt;    &lt;br /&gt;FUNCTION: void&lt;br /&gt;cairo_transform ( cairo_t* cr, cairo_matrix_t* matrix ) ;&lt;/pre&gt;&lt;br /&gt;The Factor compiler generates stubs for calling C functions on the fly from these declarative descriptions; there is no C code generator and no dependency on a C compiler. In fact, C library bindings are so easy to write that for many contributors, it is their first project in Factor. When &lt;a href="http://code-factor.blogspot.com"&gt;Doug Coleman&lt;/a&gt; first got involved in Factor, he began by writing a PostgreSQL binding, followed by an implementation of the MD5 checksum. Both libraries have been heavily worked on since then are still in use.&lt;br /&gt;&lt;br /&gt;For complete examples of FFI usage, check out any of the following:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=tree;f=basis/unix;hb=HEAD"&gt;unix&lt;/a&gt; - POSIX bindings&lt;/li&gt;&lt;li&gt;&lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=tree;f=basis/windows;hb=HEAD"&gt;windows&lt;/a&gt; - Windows API bindings&lt;/li&gt;&lt;li&gt;&lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=tree;f=basis/db/sqlite/ffi;hb=HEAD"&gt;db.sqlite.ffi&lt;/a&gt; - SQLite bindings&lt;/li&gt;&lt;li&gt;&lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=tree;f=basis/opengl/gl;hb=HEAD"&gt;opengl.gl&lt;/a&gt; - OpenGL bindings&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;There are many more usages of the FFI of course. Since Factor has a minimal VM, all I/O, graphics and interaction with the outside world in general is done with the FFI. Search the Factor source tree for source files that use the &lt;a href="http://docs.factorcode.org/content/vocab-alien.syntax.html"&gt;alien.syntax&lt;/a&gt; vocabulary.&lt;br /&gt;&lt;br /&gt;&lt;h4&gt;GPU shaders&lt;/h4&gt;&lt;br /&gt;&lt;a href="http://duriansoftware.com"&gt;Joe Groff&lt;/a&gt; cooked up a nice DSL for passing uniform parameters to pixel and vertex shaders. In his &lt;a href="http://duriansoftware.com/joe/Spring-cleaning.html"&gt;blog post&lt;/a&gt;, Joe writes:&lt;br /&gt;&lt;blockquote&gt;&lt;i&gt;The library makes it easy to load and interactively update shaders, define binary formats for GPU vertex buffers, and feed parameters to shader code using Factor objects. &lt;/i&gt;&lt;/blockquote&gt;&lt;br /&gt;Here is a snippet from the &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=tree;f=extra/gpu/demos/raytrace;hb=HEAD"&gt;gpu.demos.raytrace&lt;/a&gt; demo:&lt;br /&gt;&lt;pre&gt;GLSL-SHADER-FILE: raytrace-vertex-shader vertex-shader "raytrace.v.glsl"&lt;br /&gt;GLSL-SHADER-FILE: raytrace-fragment-shader fragment-shader "raytrace.f.glsl"&lt;br /&gt;GLSL-PROGRAM: raytrace-program&lt;br /&gt;    raytrace-vertex-shader raytrace-fragment-shader ;&lt;br /&gt;&lt;br /&gt;UNIFORM-TUPLE: sphere-uniforms&lt;br /&gt;    { "center" vec3-uniform  f }&lt;br /&gt;    { "radius" float-uniform f }&lt;br /&gt;    { "color"  vec4-uniform  f } ;&lt;br /&gt;&lt;br /&gt;UNIFORM-TUPLE: raytrace-uniforms&lt;br /&gt;    { "mv-inv-matrix"    mat4-uniform f }&lt;br /&gt;    { "fov"              vec2-uniform f }&lt;br /&gt;    &lt;br /&gt;    { "spheres"          sphere-uniforms 4 }&lt;br /&gt;&lt;br /&gt;    { "floor-height"     float-uniform f }&lt;br /&gt;    { "floor-color"      vec4-uniform 2 }&lt;br /&gt;    { "background-color" vec4-uniform f }&lt;br /&gt;    { "light-direction"  vec3-uniform f } ;&lt;/pre&gt;&lt;br /&gt;The &lt;a href="http://docs.factorcode.org/content/word-GLSL-SHADER-FILE__colon__,gpu.shaders.html"&gt;GLSL-SHADER-FILE:&lt;/a&gt; parsing word tells Factor to load a GLSL shader program. The GPU framework automatically checks the file for modification, reloading it if necessary.&lt;br /&gt;&lt;br /&gt;The &lt;a href="http://docs.factorcode.org/content/word-UNIFORM-TUPLE__colon__,gpu.render.html"&gt;UNIFORM-TUPLE:&lt;/a&gt; parsing word defines a new tuple class, together with methods which destructure the tuple and bind textures and uniform parameters. Uniform parameters are named as such because they define values which remain constant at every pixel or vertex that the shader program operates on.&lt;br /&gt;&lt;br /&gt;&lt;h4&gt;Instruction definitions in the compiler&lt;/h4&gt;&lt;br /&gt;This one is rather obscure and technical, but it has made my job easier over the last few weeks. &lt;a href="http://factor-language.blogspot.com/2009/09/eliminating-some-boilerplate-in.html"&gt;I blogged about it already&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Other examples&lt;/h3&gt;&lt;br /&gt;The next set of DSLs don't involve parsing words as much as just clever tricks with evaluation semantics.&lt;br /&gt;&lt;br /&gt;&lt;h4&gt;Inverse&lt;/h4&gt;&lt;br /&gt;&lt;a href="http://useless-factor.blogspot.com"&gt;Daniel Ehrenberg's&lt;/a&gt; &lt;a href="http://docs.factorcode.org/content/vocab-inverse.html"&gt;inverse&lt;/a&gt; library implements a form of pattern matching by computing the inverse of a Factor quotation. The fundamental combinator, &lt;a href="http://docs.factorcode.org/content/word-undo,inverse.html"&gt;undo&lt;/a&gt;, takes a Factor quotation, and executes it "in reverse". So if there is a constructed tuple on the stack, undoing the constructor will leave the slots on the stack. If the top of the stack doesn't match anything that the constructor could've produced, then the inverse fails, and pattern matching can move on to the next clause. This library works by introspecting quotations and the words they contain. Dan gives many details and examples in &lt;a href="http://www.scribd.com/doc/12803254/Pattern-matching-in-concatenative-languages"&gt;his paper on inverse&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;h4&gt;Help system&lt;/h4&gt;&lt;br /&gt;&lt;a href="http://docs.factorcode.org/content/article-help.html"&gt;Factor's help system&lt;/a&gt; uses an s-expression-like markup language. Help markup is parsed by the Factor parser without any special parsing words. A markup element is an array where the first element is a distinguished word and the rest are parameters. Examples:&lt;br /&gt;&lt;pre&gt;"This is " { $strong "not" } " a good idea"&lt;br /&gt;&lt;br /&gt;{ $list&lt;br /&gt;    "milk"&lt;br /&gt;    "flour"&lt;br /&gt;    "eggs"&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;{ $link "help" }&lt;/pre&gt;&lt;br /&gt;This markup is rendered either directly on the Factor UI (like in this &lt;a href="http://factorcode.org/factor-windows7.png"&gt;screenshot&lt;/a&gt;) or via HTML, as on the &lt;a href="http://docs.factorcode.org"&gt;docs.factorcode.org&lt;/a&gt; site.&lt;br /&gt;&lt;br /&gt;The nice thing about being able to view help in the UI environment is the sheer interactive nature of it. Unlike something like javadoc, there is no offline processing step which takes your source file and spits out rendered markup. You just load a vocabulary into your Factor instance and the documentation is available instantly. You can look at the help for any documented word by simply typing something like &lt;pre&gt;\ append help&lt;/pre&gt; in the UI listener. While working on your own vocabulary, you can reload changes to the documentation and see them appear instantly in the UI's help browser.&lt;br /&gt;&lt;br /&gt;Finally, it is worth mentioning that because of the high degree of semantic information encoded in documentation, many kinds of mistakes can be caught in an automated fashion. The &lt;a href="http://docs.factorcode.org/content/article-help.lint.html"&gt;help lint&lt;/a&gt; tool finds inconsistencies between the actual parameters that a function takes, and the documented parameters, as well as code examples that don't evaluate to what the documentation claims they evaluate to, and a few other things.&lt;br /&gt;&lt;br /&gt;You won't find a lot of comments in Factor source, because the help system is much more useful. Instead of plain-text comments that can go out of date, why not have rich text with hyperlinks and semantic information?&lt;br /&gt;&lt;br /&gt;For examples of help markup, look at any file whose name ends with &lt;code&gt;-docs.factor&lt;/code&gt; in the Factor source tree. There are plenty.&lt;br /&gt;&lt;br /&gt;&lt;h4&gt;x86 and PowerPC assemblers&lt;/h4&gt;&lt;br /&gt;I put this one last since its not really a DSL at all, just a nice API. The lowest level of Factor's compiler generates machine code from the compiler's intermediate form in a CPU-specific way. The CPU backends for x86 and PowerPC use the &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=tree;f=basis/cpu/x86/assembler;hb=HEAD"&gt;cpu.x86.assembler&lt;/a&gt; and &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=tree;f=basis/cpu/ppc/assembler;hb=HEAD"&gt;cpu.ppc.assembler&lt;/a&gt; vocabularies for this, respectively. The way the assemblers work is that they define a set of words corresponding to CPU instructions. Instruction words takes operands from the stack -- which are objects representing registers, immediate values, and in the case of the x86, addressing modes. They then combine the operands and the instruction into a binary opcode, and add it to a byte vector stored in a dynamically-scoped variable. So instead of calling methods on, and passing around, an 'assembler object' as you would in say, a JIT coded in C++, you wrap the code generation in a call to Factor's &lt;a href="http://docs.factorcode.org/content/word-make,make.html"&gt;make&lt;/a&gt; word, and simply invoke instruction words therein. The result looks like assembler source, except it is postfix. Here is an x86 example:&lt;br /&gt;&lt;pre&gt;[&lt;br /&gt;    EAX ECX ADD&lt;br /&gt;    XMM0 XMM1 HEX: ff SHUFPS&lt;br /&gt;    AL 7 OR&lt;br /&gt;    RAX 15 [+] RDI MOV&lt;br /&gt;] B{ } make .&lt;/pre&gt;&lt;br /&gt;When evaluated, the above will print out the following;&lt;br /&gt;&lt;pre&gt;B{ 1 200 15 198 193 255 131 200 7 72 137 120 15 }&lt;/pre&gt;&lt;br /&gt;Of course, &lt;a href="http://docs.factorcode.org/content/word-B{,syntax.html"&gt;B{&lt;/a&gt; is the homoiconic syntax for a byte array. Note the way indirect memory operands are constructed; first, we push the register (&lt;code&gt;RAX&lt;/code&gt;) then the displacement (here the constant 15, but register displacement is supported by x86 too). Then we call a word &lt;code&gt;[+]&lt;/code&gt; which constructs an object representing the addressing mode &lt;code&gt;[RAX+15]&lt;/code&gt;.&lt;br /&gt;&lt;br /&gt;The rationale for choosing this somewhat funny syntax for indirect operands (there is also a &lt;code&gt;[]&lt;/code&gt; word for memory loads without a displacement), rather than some kind of parser hack that allows one to write &lt;code&gt;[RAX]&lt;/code&gt; or &lt;code&gt;[R14+RDI]&lt;/code&gt; directly, is that in reality the compiler only rarely deals with hard-coded register assignments. Instead, the register allocator makes decisions a level above, and passes them to the code generator. Here is a typical compiler code generation template from the &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=basis/cpu/x86/x86.factor;hb=HEAD"&gt;cpu.x86&lt;/a&gt; vocabulary:&lt;br /&gt;&lt;pre&gt;M:: x86 %check-nursery ( label temp1 temp2 -- )&lt;br /&gt;    temp1 load-zone-ptr&lt;br /&gt;    temp2 temp1 cell [+] MOV&lt;br /&gt;    temp2 1024 ADD&lt;br /&gt;    temp1 temp1 3 cells [+] MOV&lt;br /&gt;    temp2 temp1 CMP&lt;br /&gt;    label JLE ;&lt;/pre&gt;&lt;br /&gt;Here, I'm using the locals vocabulary together with the assembler. The &lt;code&gt;temp1&lt;/code&gt; and &lt;code&gt;temp2&lt;/code&gt; parameters are registers and &lt;code&gt;label&lt;/code&gt; is, as its name implies, a label to jump to. This snippet generates machine code that checks whether or not the new object allocation area has enough space; if so, it jumps to the label, otherwise it falls through (code to save live registers and call the GC is generated next). The &lt;code&gt;load-zone-ptr&lt;/code&gt; word is like an assembler macro here; it takes a register and generates some more code with it.&lt;br /&gt;&lt;br /&gt;The PowerPC assembler is a tad more interesting. Since the x86 instruction set is so complex, with many addressing modes and so on, the x86 assembler is implemented in a rather tedious manner. Obvious duplication is abstracted out. However, there is a lot of case-by-case code for different groups of instructions, with no coherent underlying abstraction allowing the instruction set to be described in a declarative way.&lt;br /&gt;&lt;br /&gt;On PowerPC, the situation is better. Since the instruction set is a lot more regular (fixed width instructions, only a few distinct instruction formats, no addressing modes), the PowerPC assembler itself is built using a DSL specifically for describing PowerPC instructions:&lt;br /&gt;&lt;pre&gt;D: ADDI 14&lt;br /&gt;D: ADDIC 12&lt;br /&gt;D: ADDIC. 13&lt;br /&gt;D: ADDIS 15&lt;br /&gt;D: CMPI 11&lt;br /&gt;D: CMPLI 10&lt;br /&gt;...&lt;/pre&gt;&lt;br /&gt;The PowerPC instruction format DSL is defined in the &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=tree;f=basis/cpu/ppc/assembler/backend;hb=HEAD"&gt;cpu.ppc.assembler.backend&lt;/a&gt; vocabulary, and as a result the &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=basis/cpu/ppc/assembler/assembler.factor;hb=HEAD"&gt;cpu.ppc.assembler&lt;/a&gt; vocabulary itself is mostly trivial.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Last words&lt;/h3&gt;&lt;br /&gt;Usually my blog posts describe recent progress in the Factor implementation, and I tend to write about what I'm working on right now. I'm currently working on code generation for SIMD vector instructions in the Factor compiler. I was going to blog about this instead, but decided not to do it until the SIMD implementation and API settles down some more.&lt;br /&gt;&lt;br /&gt;With this post I decided to try something a bit different, and instead just describe an aspect of Factor that interests to me, without any of it being particularly breaking news. If you've been following Factor development closely, there is literally nothing in this post that you would not have seen already, however I figured people who don't track the project so closely might appreciate a general survey like this. I'm also thinking of writing a post describing Factor's various high-level I/O libraries. I'd appreciate any feedback, suggestions and ideas on this matter.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17087850-2579054926441220099?l=factor-language.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://factor-language.blogspot.com/feeds/2579054926441220099/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17087850&amp;postID=2579054926441220099' title='10 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/2579054926441220099'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/2579054926441220099'/><link rel='alternate' type='text/html' href='http://factor-language.blogspot.com/2009/09/survey-of-domain-specific-languages-in.html' title='A survey of domain-specific languages in Factor'/><author><name>Slava Pestov</name><uri>http://www.blogger.com/profile/02768382790667979877</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>10</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17087850.post-1091797422319916611</id><published>2009-09-12T23:20:00.010-04:00</published><updated>2009-09-13T17:05:16.230-04:00</updated><title type='text'>Advanced floating point features: exceptions, rounding modes, denormals, unordered compares, and more</title><content type='html'>Factor now has a nice library for introspecting and modifying the floating point environment. &lt;a href="http://duriansoftware.com/joe"&gt;Joe Groff&lt;/a&gt; implemented most of it, and I helped out with debugging and additional floating point comparison operations. All of these features are part of the &lt;a href="http://en.wikipedia.org/wiki/IEEE_754-2008"&gt;IEEE floating point specification&lt;/a&gt; and are implemented on all modern CPUs, however few programming languages expose a nice interface to working with them. C compilers typically provide hard-to-use low-level intrinsics and other languages don't tend to bother at all. Two exceptions are the &lt;a href="http://sbcl.org"&gt;SBCL compiler&lt;/a&gt; and the &lt;a href="http://www.digitalmars.com/d/"&gt;D language&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;The new functionality is mostly contained in the &lt;code&gt;math.floats.env&lt;/code&gt; vocabulary, with a few new words in &lt;code&gt;math&lt;/code&gt; for good measure. The new code is in the repository but it is not completely stable yet; there are still some issues we need to work out on the more obscure platforms.&lt;br /&gt;&lt;br /&gt;To follow along with the examples below, you'll need to get a git checkout from the master branch and load the vocabulary in your listener:&lt;br /&gt;&lt;pre&gt;USE: math.floats.env&lt;/pre&gt;&lt;br /&gt;The first two features, floating point exceptions and traps, are useful for debugging numerical algorithms and detecting potentially undesirable situations (NaNs appearing, underflow, overflow, etc).&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Floating point exceptions&lt;/h3&gt;&lt;br /&gt;One of the first things people learn about floating point is that it has "special" values: positive and negative infinity, and not-a-number (NaN) values. These appear as the results of computations where the answer is undefined (division by zero, square root of -1, etc) or the answer is too small or large to be represented as a float (2 to the power of 10000, etc). A less widely-known fact is that when a special value is computed, "exception flags" are set in a hardware register which can be read back in. Most languages do not offer any way to access this functionality.&lt;br /&gt;&lt;br /&gt;In Factor, exception flags can be read using the &lt;code&gt;collect-fp-exceptions&lt;/code&gt; combinator, which first clears the flags, calls a quotation, then outputs any flags which were set. For example, division by zero sets the division by zero exception flag and returns infinity:&lt;br /&gt;&lt;pre&gt;( scratchpad ) [ 1.0 0.0 / ] collect-fp-exceptions . .&lt;br /&gt;{ +fp-zero-divide+ }&lt;br /&gt;1/0.&lt;/pre&gt;&lt;br /&gt;Dividing 1 by 3 sets the inexact flag, because the result (0.333....) cannot be represented as a float:&lt;br /&gt;&lt;pre&gt;( scratchpad ) [ 1.0 3.0 / ] collect-fp-exceptions . .&lt;br /&gt;{ +fp-inexact+ }&lt;br /&gt;0.3333333333333333&lt;/pre&gt;&lt;br /&gt;The fact that 1/3 does not have a terminating decimal or binary expansion is well-known, however one thing that many beginners find surprising is that some numbers which have terminating decimal expansions nevertheless cannot be represented precisely as floats because they do not terminate in binary (one classic case is 1.0 - 0.9 - 0.1 != 0.0):&lt;br /&gt;&lt;pre&gt;( scratchpad ) [ 4.0 10.0 / ] collect-fp-exceptions . .&lt;br /&gt;{ +fp-inexact+ }&lt;br /&gt;0.4&lt;/pre&gt;&lt;br /&gt;Raising a number to a power that is too large sets both the inexact and overflow flags, and returns infinity:&lt;br /&gt;&lt;pre&gt;( scratchpad ) [ 2.0 10000.0 ^ ] collect-fp-exceptions . .&lt;br /&gt;{ +fp-inexact+ +fp-overflow+ }&lt;br /&gt;1/0.&lt;/pre&gt;&lt;br /&gt;The square root of 4 is an exact value; no exceptions were set:&lt;br /&gt;&lt;pre&gt;( scratchpad ) [ 4.0 sqrt ] collect-fp-exceptions . .&lt;br /&gt;{ }&lt;br /&gt;2.0&lt;/pre&gt;&lt;br /&gt;The square root of 2 is not exact on the other hand:&lt;br /&gt;&lt;pre&gt;( scratchpad ) [ 2.0 sqrt ] collect-fp-exceptions . .&lt;br /&gt;{ +fp-inexact+ }&lt;br /&gt;1.414213562373095&lt;/pre&gt;&lt;br /&gt;Factor supports complex numbers, so taking the square root of -1 returns an exact value and does not set any exceptions:&lt;br /&gt;&lt;pre&gt;( scratchpad ) [ -1.0 sqrt ] collect-fp-exceptions . .&lt;br /&gt;{ }&lt;br /&gt;C{ 0.0 1.0 }&lt;/pre&gt;&lt;br /&gt;However, we can observe the invalid operation exception flag being set if we call the internal &lt;code&gt;fsqrt&lt;/code&gt; word, which operates on floats only and calls the libc function (or uses the &lt;code&gt;SQRTSD&lt;/code&gt; instruction on SSE2):&lt;br /&gt;&lt;pre&gt;( scratchpad ) USE: math.libm [ -1.0 fsqrt ] collect-fp-exceptions . .&lt;br /&gt;{ +fp-invalid-operation+ }&lt;br /&gt;NAN: 8000000000000&lt;/pre&gt;&lt;br /&gt;I describe the new &lt;code&gt;NAN:&lt;/code&gt; syntax later in this post.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Signaling traps&lt;/h3&gt;&lt;br /&gt;Being able to inspect floating point exceptions set after a piece of code runs is all well and good, but what if you have a tricky underflow bug, or a NaN is popping up somewhere, and you want to know exactly where? In this case it is possible to set a flag in the FPU's control register which triggers a trap when an exception is raised. This trap is delivered to the Factor process as a signal (Unix), Mach exception (Mac OS X), or SEH exception (Windows). Factor then throws it as an exception which can be caught using any of Factor's &lt;a href="http://docs.factorcode.org/content/article-errors.html"&gt;error handling&lt;/a&gt; words, or just left unhandled in which case it will bubble up to the listener.&lt;br /&gt;&lt;br /&gt;The &lt;code&gt;with-fp-traps&lt;/code&gt; combinator takes a list of traps and runs a quotation with those traps enabled; when the quotation completes (or throws an error) the former FPU state is restored again (indeed it has to be this way, since running the Factor UI's rendering code with traps enabled quickly kills it). The &lt;code&gt;all-fp-exceptions&lt;/code&gt; word is equivalent to specifying &lt;code&gt;{ +fp-invalid-operation+ +fp-overflow+ +fp-underflow+ +fp-zero-divide+ +fp-inexact+ }&lt;/code&gt;. Here is an example:&lt;br /&gt;&lt;pre&gt;( scratchpad ) all-fp-exceptions [ 0.0 0.0 / ] with-fp-traps&lt;br /&gt;Floating point trap&lt;/pre&gt;&lt;br /&gt;Without the combinator wrapped around it, &lt;code&gt;0.0 0.0 /&lt;/code&gt; simply returns a NaN value without throwing anything.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Rounding modes&lt;/h3&gt;&lt;br /&gt;Unlike exceptions and traps, which do not change the result of a computation but merely set status flags (or interrupt it), the next two features, the rounding mode and denormal mode, actually change the results of computations. As with exceptions and traps, they are implemented as scoped combinators rather than global state changes to ensure that code using these features is 'safe' and cannot change floating point state of surrounding code.&lt;br /&gt;&lt;br /&gt;If a floating point operation produces an inexact result, there is the question of how the result will be rounded to a value representable as a float. There are four rounding modes in IEEE floating point:&lt;ul&gt;&lt;li&gt;&lt;code&gt;+round-nearest+&lt;/code&gt;&lt;/li&gt;&lt;li&gt;&lt;code&gt;+round-down+&lt;/code&gt;&lt;/li&gt;&lt;li&gt;&lt;code&gt;+round-up+&lt;/code&gt;&lt;/li&gt;&lt;li&gt;&lt;code&gt;+round-zero+&lt;/code&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;Here is an example of an inexact computation done with two different rounding modes; the default (&lt;code&gt;+round-nearest+&lt;/code&gt;) and &lt;code&gt;+round-up+&lt;/code&gt;:&lt;br /&gt;&lt;pre&gt;( scratchpad ) 1.0 3.0 / .&lt;br /&gt;0.3333333333333333&lt;br /&gt;( scratchpad ) +round-up+ [ 1.0 3.0 / ] with-rounding-mode .&lt;br /&gt;0.3333333333333334&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Denormals&lt;/h3&gt;&lt;br /&gt;&lt;a href="http://en.wikipedia.org/wiki/Denormal_number"&gt;Denormal numbers&lt;/a&gt; are numbers where the exponent consists of zero bits (the minimum value) but the mantissa is not all zeros. Denormal numbers are undesirable because they have lower precision than normal floats, and on some CPUs computations with denormals are less efficient than with normals. IEEE floating point supports two denormal modes: you can elect to have denormals "flush" to zero (&lt;code&gt;+denormal-flush+&lt;/code&gt;), or you can "keep" denormals (&lt;code&gt;+denormal-keep+&lt;/code&gt;). The latter is the default:&lt;br /&gt;&lt;pre&gt;( scratchpad ) +denormal-flush+ [ 51 2^ bits&gt;double 0.0 + ] with-denormal-mode .&lt;br /&gt;0.0&lt;br /&gt;( scratchpad ) 51 2^ bits&gt;double 0.0 + .&lt;br /&gt;1.112536929253601e-308&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Ordered and unordered comparisons&lt;/h3&gt;&lt;br /&gt;In math, for any two numbers &lt;code&gt;a&lt;/code&gt; and &lt;code&gt;b&lt;/code&gt;, one of the following three properties hold:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;a &amp;lt; b&lt;/li&gt;&lt;li&gt;a = b&lt;/li&gt;&lt;li&gt;a &amp;gt; b&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;In floating point, there is a fourth possibility; &lt;code&gt;a&lt;/code&gt; and &lt;code&gt;b&lt;/code&gt; are &lt;i&gt;unordered&lt;/i&gt;. This occurs if one of the two values is a NaN. The &lt;code&gt;unordered?&lt;/code&gt; predicate tests for this possibility:&lt;br /&gt;&lt;pre&gt;( scratchpad ) NAN: 8000000000000 1.0 unordered? .&lt;br /&gt;t&lt;/pre&gt;&lt;br /&gt;If an ordered comparison word such as &lt;code&gt;&amp;lt;&lt;/code&gt; or &lt;code&gt;&amp;gt;=&lt;/code&gt; is called with two values which are unordered, they return &lt;code&gt;f&lt;/code&gt; and set the &lt;code&gt;+fp-invalid-operation+&lt;/code&gt; exception:&lt;br /&gt;&lt;pre&gt;( scratchpad ) NAN: 8000000000000 1.0 [ &amp;lt; ] collect-fp-exceptions . .&lt;br /&gt;{ +fp-invalid-operation+ }&lt;br /&gt;f&lt;/pre&gt;&lt;br /&gt;If traps are enabled this will throw an error:&lt;br /&gt;&lt;pre&gt;( scratchpad ) NAN: 8000000000000 1.0 { +fp-invalid-operation+ } [ &lt; ] with-fp-traps    &lt;br /&gt;Floating point trap&lt;/pre&gt;&lt;br /&gt;If your numerical algorithm has a legitimate use for NaNs, and you wish to run it with traps enabled, and have certain comparisons not signal traps when inputs are NaNs, you can use &lt;i&gt;unordered&lt;/i&gt; comparisons in those cases instead:&lt;br /&gt;&lt;pre&gt;( scratchpad ) NAN: 8000000000000 1.0 [ u&amp;lt; ] collect-fp-exceptions . .&lt;br /&gt;{ }&lt;br /&gt;f&lt;/pre&gt;&lt;br /&gt;Unordered versions of all the comparisons are defined now, &lt;code&gt;u&amp;lt;&lt;/code&gt;, &lt;code&gt;u&amp;lt;=&lt;/code&gt;, &lt;code&gt;u&gt;&lt;/code&gt;, and &lt;code&gt;u&gt;=&lt;/code&gt;. Equality of numbers is always unordered, so it does not raise traps if one of the inputs is a NaN. In particular, if both inputs are NaNs, equality always returns &lt;code&gt;f&lt;/code&gt;:&lt;br /&gt;&lt;pre&gt;( scratchpad ) NAN: 8000000000000 dup [ number= ] collect-fp-exceptions . .&lt;br /&gt;{ }&lt;br /&gt;f&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Half-precision floats&lt;/h3&gt;&lt;br /&gt;Everyone and their drunk buddy know about IEEE single (32-bit) and double (64-bit) floats; IEEE also defines half-precision 16-bit floats. These are not used nearly as much; they come up in graphics programming for example, since GPUs use them for certain calculations with color components where you don't need more accuracy. The &lt;code&gt;half-floats&lt;/code&gt; vocabulary provides some support for working with half-floats. It defines a pair of words for converting Factor's double-precision floats to and from half-floats, as well as C type support for passing half-floats to C functions via FFI, and building packed arrays of half-floats for passing to the GPU.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Literal syntax for NaNs and hexadecimal floats&lt;/h3&gt;&lt;br /&gt;You may have noticed the funny &lt;code&gt;NAN:&lt;/code&gt; syntax above. Previously all NaN values would print as &lt;code&gt;0/0.&lt;/code&gt;, however this is inaccurate since not all NaNs are created equal; because of how IEEE floating point works, a value is a NaN if the exponent consists of all ones, leaving the mantissa unspecified. The mantissa is known as the "NaN payload" in this case. NaNs now print out, and can be parsed back in, using a syntax that makes the payload explicit. A NaN can also be constructed with an arbitrary payload using the &lt;code&gt;&amp;lt;fp-nan&gt;&lt;/code&gt; word:&lt;br /&gt;&lt;pre&gt;( scratchpad ) HEX: deadbeef &amp;lt;fp-nan&gt; .&lt;br /&gt;NAN: deadbeef&lt;/pre&gt;&lt;br /&gt;The old &lt;code&gt;0/0.&lt;/code&gt; syntax still works; it is shorthand for &lt;code&gt;NAN: 8000000000000&lt;/code&gt;, the canonical "quiet" NaN.&lt;br /&gt;&lt;br /&gt;Some operations produce NaNs with different payloads:&lt;br /&gt;&lt;pre&gt;( scratchpad ) USE: math.libm&lt;br /&gt;( scratchpad ) 2.0 facos .&lt;br /&gt;NAN: 8000000000022&lt;/pre&gt;&lt;br /&gt;In general, there is very little you can do with the NaN payload.&lt;br /&gt;&lt;br /&gt;A more useful feature is hexadecimal float literals. When reading a float from a decimal string, or printing a float to a decimal string, there is sometimes ambiguity due to rounding. No such problem exists with hexadecimal floats.&lt;br /&gt;&lt;br /&gt;An example of printing a number as a decimal and a hexadecimal float:&lt;br /&gt;&lt;pre&gt;( scratchpad ) USE: math.constants&lt;br /&gt;( scratchpad ) pi .&lt;br /&gt;3.141592653589793&lt;br /&gt;( scratchpad ) pi .h&lt;br /&gt;1.921fb54442d18p1&lt;/pre&gt;&lt;br /&gt;Java supports hexadecimal float literals as of Java 1.5. Hats off to the Java designers for adding this! It would be nice if they would add the rest of the IEEE floating point functionality in Java 7.&lt;br /&gt;&lt;h3&gt;Signed zero&lt;/h3&gt;&lt;br /&gt;Unlike twos-complement integer arithmetic, IEEE floating point has both positive and negative zero. Negative zero is used as a result of computations of very small negative numbers that underflowed. They also have applications in complex analysis because they allow a choice of branch cut to be made. Factor's &lt;code&gt;abs&lt;/code&gt; word used to be implemented incorrectly on floats; it would check if the input was negative, and if so multiply it by negative one. However this was a problem because negative zero is not less than zero, and so the absolute value of negative zero would be reported as negative zero. The correct implementation of the absolute value function on floats is to simply clear the sign bit. It works properly now:&lt;br /&gt;&lt;pre&gt;( scratchpad ) -0.0 abs .&lt;br /&gt;0.0&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Implementation&lt;/h3&gt;&lt;br /&gt;The implementation of the above features consists of several parts:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Cross-platform Factor code in the &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=tree;f=basis/math/floats/env;hb=HEAD"&gt;math.floats.env&lt;/a&gt; vocabulary implementing the high-level API&lt;/li&gt;&lt;li&gt;Assembly code in &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=vm/cpu-x86.32.S;hb=HEAD"&gt;vm/cpu-x86.32.S&lt;/a&gt;, &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=vm/cpu-x86.64.S;hb=HEAD"&gt;vm/cpu-x86.64.S&lt;/a&gt;, and &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=vm/cpu-ppc.S;hb=HEAD"&gt;vm/cpu-ppc.S&lt;/a&gt; to read and write x87, SSE2 and PowerPC FPU control registers&lt;/li&gt;&lt;li&gt;Low-level code in &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=tree;f=basis/math/floats/env/x86;hb=HEAD"&gt;math.floats.env.x86&lt;/a&gt; and &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=tree;f=basis/math/floats/env/ppc;hb=HEAD"&gt;math.floats.env.ppc&lt;/a&gt; which implements the high-level API in terms of the assembly functions, by calling them via &lt;a href="http://docs.factorcode.org/content/article-alien.html"&gt;Factor's FFI&lt;/a&gt; and parsing control registers into a cross-platform representation in terms of Factor symbols&lt;/li&gt;&lt;li&gt;Miscellaneous words for taking floats apart into their bitwise representation in the &lt;code&gt;math&lt;/code&gt; vocabulary&lt;/li&gt;&lt;li&gt;Compiler support for ordered and unordered floating point compare instructions in &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=basis/compiler/cfg/instructions/instructions.factor;hb=HEAD"&gt;compiler.cfg.instructions&lt;/a&gt;&lt;/li&gt;&lt;li&gt;CPU-specific code generation for ordered and unordered floating point compare instructions in &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=basis/cpu/x86/x86.factor;hb=HEAD"&gt;cpu.x86&lt;/a&gt; and &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=basis/cpu/ppc/ppc.factor;hb=HEAD"&gt;cpu.ppc&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17087850-1091797422319916611?l=factor-language.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://factor-language.blogspot.com/feeds/1091797422319916611/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17087850&amp;postID=1091797422319916611' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/1091797422319916611'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/1091797422319916611'/><link rel='alternate' type='text/html' href='http://factor-language.blogspot.com/2009/09/advanced-floating-point-features.html' title='Advanced floating point features: exceptions, rounding modes, denormals, unordered compares, and more'/><author><name>Slava Pestov</name><uri>http://www.blogger.com/profile/02768382790667979877</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17087850.post-2014814548295621053</id><published>2009-09-02T06:59:00.003-04:00</published><updated>2009-09-02T07:20:06.195-04:00</updated><title type='text'>Eliminating some boilerplate in the compiler</title><content type='html'>Adding new instructions to the low-level optimizer was too hard. Multiple places had to be updated, and I would do all this by hand:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;The instruction tuple itself is defined in the &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=basis/compiler/cfg/instructions/instructions.factor;hb=HEAD"&gt;compiler.cfg.instructions&lt;/a&gt; vocabulary with the &lt;code&gt;INSN:&lt;/code&gt; class, which also creates a word with the same name that constructs the instruction and adds it to the current sequence.&lt;/li&gt;&lt;li&gt;Instructions which have a destination register have convenient constructors in &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=basis/compiler/cfg/hats/hats.factor;hb=HEAD"&gt;compiler.cfg.hats&lt;/a&gt; which creates a fresh virtual register, creates an instruction with this register as the destination, and outputs it. So for example, &lt;code&gt;1 2 ^^add&lt;/code&gt; would create an add instruction with a fresh destination register, and output this register. It might be equivalent to something like &lt;code&gt;0 1 2 ##add&lt;/code&gt;.&lt;/li&gt;&lt;li&gt;Instructions that use virtual registers are be added to the &lt;code&gt;vreg-insn&lt;/code&gt; union, and respond to methods &lt;code&gt;defs-vreg&lt;/code&gt;, &lt;code&gt;uses-vregs&lt;/code&gt;, and &lt;code&gt;temp-vregs&lt;/code&gt; in &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=basis/compiler/cfg/def-use/def-use.factor;hb=HEAD"&gt;compiler.cfg.def-use&lt;/a&gt;. This 'def-use' information is used by SSA construction, dead code elimination, copy coalescing, register allocation, among other things.&lt;/li&gt;&lt;li&gt;Methods have to be defined for the instruction in &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=basis/compiler/cfg/renaming/functor/functor.factor;hb=HEAD"&gt;compiler.cfg.renaming.functor&lt;/a&gt;. This functor is used to generate code for renaming virtual registers in instructions. The renaming code is used for SSA construction, representation selection, register allocation, among other things.&lt;/li&gt;&lt;li&gt;Instructions which use non-integer representations (eg, operations on floats) must respond to methods &lt;code&gt;defs-vreg-rep&lt;/code&gt;, &lt;code&gt;uses-vreg-reps&lt;/code&gt;, and &lt;code&gt;temp-vreg-reps&lt;/code&gt; in &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=basis/compiler/cfg/representations/preferred/preferred.factor"&gt;compiler.cfg.representations.preferred&lt;/a&gt;.&lt;/li&gt;&lt;li&gt;Instructions must respond to the &lt;code&gt;generate-insn&lt;/code&gt; method, defined in &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=basis/compiler/codegen/codegen.factor;hb=HEAD"&gt;compiler.codegen&lt;/a&gt;.&lt;/li&gt;&lt;li&gt;Instructions which participate in value numbering must define an "expression" variant, and respond to the &lt;code&gt;&gt;expr&lt;/code&gt; method defined in &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=basis/compiler/cfg/value-numbering/expressions/expressions.factor;hb=HEAD"&gt;compiler.cfg.value-numbering.expressions&lt;/a&gt;&lt;/ul&gt;&lt;br /&gt;As you can see, this is a lot of duplicated work. I used inheritance and mixins to model relationships between instructions and reduce some of this duplication by defining methods on common superclasses rather than individual instructions where possible, but this never seemed to work out well.&lt;br /&gt;&lt;br /&gt;If you look at the latest versions of the source files I linked to above, you'll see that all the repetitive copy-pasted code has been replaced with meta-programming. The new approach extends the &lt;code&gt;INSN:&lt;/code&gt; parsing word so now all relevant information is specified in one place. Also there is a new &lt;code&gt;PURE-INSN:&lt;/code&gt; parsing word, to mark instructions which participate in value numbering; previously this was done with a superclass. For example,&lt;br /&gt;&lt;pre&gt;PURE-INSN: ##add&lt;br /&gt;def: dst/int-rep&lt;br /&gt;use: src1/int-rep src2/int-rep ;&lt;/pre&gt;&lt;br /&gt;This defines an instruction tuple, an expression tuple, def/use information, representation information, a method for converting instructions to expressions, and a constructor, all at the same time.&lt;br /&gt;&lt;br /&gt;For the code generator's &lt;code&gt;generate-insn&lt;/code&gt; method, not every instruction has a straightforward implementation; some, like GC checks and FFI calls, postpone a fair amount of work until the very last stage of compilation. However, for most instructions, this method simply extracts all the slots from the instruction tuple, then calls a CPU-specific hook in the &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=basis/cpu/architecture/architecture.factor;hb=HEAD"&gt;cpu.architecture&lt;/a&gt; vocabulary to generate code. For these cases, a &lt;code&gt;CODEGEN:&lt;/code&gt; parsing word sets up the relevant boilerplate;&lt;br /&gt;&lt;pre&gt;CODEGEN: ##add %add&lt;/pre&gt;&lt;br /&gt;is equivalent to&lt;br /&gt;&lt;pre&gt;M: ##add generate-insn [ dst&gt;&gt; ] [ src1&gt;&gt; ] [ src2&gt;&gt; ] tri %add ;&lt;/pre&gt;&lt;br /&gt;This nicely cleans up all the repetition I mentioned in the bullet points at the top.&lt;br /&gt;&lt;br /&gt;I've been aware of this boilerplate for a while but wanted the design of the compiler to settle down first. Now that most of the machinery is in place, I feel comfortable cooking up some complex meta-programming to clean things up. Adding new instructions should be easier. I plan on adding some SSE2 vector operations soon, and this was the main motivation behind this cleanup.&lt;br /&gt;&lt;br /&gt;How would you do this in other languages? In Lisp, you would use a macro which expands into a bunch of &lt;code&gt;defclass&lt;/code&gt;, &lt;code&gt;defmethod&lt;/code&gt;, etc forms. In Java, you might use annotations:&lt;br /&gt;&lt;pre&gt;public class AddInsn extends Insn {&lt;br /&gt;    @Def @Representation("int") Register dst;&lt;br /&gt;    @Use @Representation("int") Register src1;&lt;br /&gt;    @Use @Representation("int") Register src2;&lt;br /&gt;}&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17087850-2014814548295621053?l=factor-language.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://factor-language.blogspot.com/feeds/2014814548295621053/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17087850&amp;postID=2014814548295621053' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/2014814548295621053'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/2014814548295621053'/><link rel='alternate' type='text/html' href='http://factor-language.blogspot.com/2009/09/eliminating-some-boilerplate-in.html' title='Eliminating some boilerplate in the compiler'/><author><name>Slava Pestov</name><uri>http://www.blogger.com/profile/02768382790667979877</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17087850.post-3402492776986795161</id><published>2009-08-29T21:44:00.015-04:00</published><updated>2009-08-30T06:13:46.198-04:00</updated><title type='text'>Struct arrays benchmark revisited: trig function calls are slow in Java, but without them Factor is still 3x faster</title><content type='html'>My &lt;a href="http://factor-language.blogspot.com/2009/08/performance-comparison-between-factor.html"&gt;struct arrays benchmark&lt;/a&gt; generated a fair amount of discussion on &lt;a href="http://www.reddit.com/r/programming/comments/9f1dp/factor_performance_comparison_between_factor_and/"&gt;reddit&lt;/a&gt;, with some people disputing the benchmark's validity. Certainly I'm not claiming that Factor is faster than Java HotSpot in general (on most tasks it is slower, sometimes much more so), however I think the benchmark legitimately demonstrates the performance advantage of value semantics over reference semantics.&lt;br /&gt;&lt;br /&gt;A few people pointed out that the Java version of the benchmark was spending a lot of time in trigonometric functions, and that Java computes sin and cosine "by hand" rather than using x87 FPU instructions. However, the same is also true of the Factor implementation, its just that none of the Java advocates bothered to check. I don't use x87 instructions either, and call into libc for sin and cos, just like Java. Indeed, Factor's sin and cos are even more heavyweight than Java's, because they also support complex numbers; Factor's compiler converts them to their real-valued equivalents if it can prove the inputs are real.&lt;br /&gt;&lt;br /&gt;Despite this, I modified the benchmark to not call trig functions, and instead just initialize each point as &lt;code&gt;(n+1,n+2,(n+3)*(n+3)/2)&lt;/code&gt;. Factor is still faster by roughly the same margin as before:&lt;br /&gt;&lt;table border="1"&gt;&lt;tr&gt;&lt;td&gt;Java&lt;/td&gt;&lt;td&gt;829ms (best run out of 8)&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Factor&lt;/td&gt;&lt;td&gt;284ms&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;br /&gt;A few people also mentioned that by only running the test 8 times I wasn't giving HotSpot enough time to "warm up". However, the benchmark does not make any polymorphic calls in the inner loops,  so there's really no "warm up" needed at all; and indeed I got the best time on the third iteration.&lt;br /&gt;&lt;br /&gt;I improved Factor's code generation for trigonometric function calls. Trig function calls are treated like instructions by the low-level optimizer now, which means they participate in value numbering and do not split basic blocks. Instead, the register allocator spills all live registers at the call site at the very end of code generation. This strategy does not work in general for all FFI calls, because the case where the FFI calls invokes a Factor callback needs additional runtime support. However, neither sin nor cos do this. After implementing this, I noticed that my struct-arrays benchmark was calling sin() twice on the same input, and value numbering was folding the second call away. Neat optimization to get "for free".&lt;br /&gt;&lt;br /&gt;This code generation change improves performance of both the &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=extra/benchmark/struct-arrays/struct-arrays.factor;hb=HEAD"&gt;struct-arrays&lt;/a&gt; and &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=extra/benchmark/partial-sums/partial-sums.factor;hb=HEAD"&gt;partial-sums&lt;/a&gt; benchmarks:&lt;br /&gt;&lt;table border="1"&gt;&lt;tr&gt;&lt;th&gt;Benchmark&lt;/th&gt;&lt;th&gt;Before&lt;/th&gt;&lt;th&gt;After&lt;/th&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;struct-arrays&lt;/td&gt;&lt;td&gt;1483ms&lt;/td&gt;&lt;td&gt;749ms&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;partial-sums&lt;/td&gt;&lt;td&gt;1233ms&lt;/td&gt;&lt;td&gt;938ms&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;br /&gt;Here is the &lt;a href="http://paste.factorcode.org/paste?id=848"&gt;disassembly for benchmark.struct-arrays&lt;/a&gt; with this optimization performed, and &lt;a href="http://github.com/slavapestov/factor/commit/f6a836d1e98b3ed7f9564f5fc2d59c028ba4c8d6"&gt;this is the patch&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;For what its worth, Java HotSpot runs the partial-sums benchmark in 1293ms. Hopefully HotSpot's trig function call performance will receive some attention at some point.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17087850-3402492776986795161?l=factor-language.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://factor-language.blogspot.com/feeds/3402492776986795161/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17087850&amp;postID=3402492776986795161' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/3402492776986795161'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/3402492776986795161'/><link rel='alternate' type='text/html' href='http://factor-language.blogspot.com/2009/08/struct-arrays-benchmark-revisited.html' title='Struct arrays benchmark revisited: trig function calls are slow in Java, but without them Factor is still 3x faster'/><author><name>Slava Pestov</name><uri>http://www.blogger.com/profile/02768382790667979877</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17087850.post-5777691968367622335</id><published>2009-08-28T05:21:00.008-04:00</published><updated>2009-08-28T16:04:00.939-04:00</updated><title type='text'>Performance comparison between Factor and Java on a contrived benchmark</title><content type='html'>Recently, &lt;a href="http://duriansoftware.com/joe"&gt;Joe Groff&lt;/a&gt; has been working on &lt;a href="http://docs.factorcode.org/content/article-classes.struct.html"&gt;struct classes&lt;/a&gt;, with the aim of completely replacing Factor's existing support for &lt;a href="http://docs.factorcode.org/content/article-c-structs.html"&gt;C structure types.&lt;/a&gt; The interesting thing about Joe's struct classes is that unlike the clunky old FFI struct support which operated on byte arrays, struct classes really are Factor classes; instances look and feel like Factor objects, there's literal syntax and prettyprinter support, and structure fields are accessed using accessor words that look exactly like Factor tuple slot accessors.&lt;br /&gt;&lt;br /&gt;So whereas the old C structure support didn't have much use outside of the FFI, the new struct classes become a useful and type-safe abstraction in themselves. Tuples are composed of dynamically-typed slots which reference Factor values, and structs are composed of scalar data stored in contiguous memory. This makes them useful in code that otherwise does not use the FFI.&lt;br /&gt;&lt;br /&gt;What makes struct classes even more useful in performance-sensitive code is the related &lt;a href="http://docs.factorcode.org/content/article-struct-arrays.html"&gt;struct-arrays&lt;/a&gt; vocabulary. This presents a sequence of struct values, stored in contiguous memory, as a &lt;a href="http://docs.factorcode.org/content/article-sequence-protocol.html"&gt;Factor sequence&lt;/a&gt;. This allows ordinary high-level sequence operations and combinators (map, filter, change-each, ...) to be used on scalar data with a very efficient in-memory representation.&lt;br /&gt;&lt;br /&gt;It is interesting that Factor's core is rather high-level and dynamically typed, but various libraries built off to the side using meta-programming facilities implement features useful for systems programming, such as specialized array types, allowing binary pointerless data to be represented and manipulated efficiently. This is unprecedented in dynamic languages, where the usual approach is to farm out performance-intensive work to another language.&lt;br /&gt;&lt;br /&gt;The performance difference between, say, an array of pointers to boxed floats, and a raw array of floats, cannot be understated. Further levels of structure make the difference even more dramatic: an array of pointers to objects where each one has three fields pointing to boxed floats (what a mouthful), is considerably less efficient to work with than an array of structures with three float fields each. In the former case, a great many objects are allocated and pointers traversed while working with the data; in the latter case, all data can be stored in one contiguous memory block, that appears as a simple byte array to the garbage collector.&lt;br /&gt;&lt;br /&gt;Dynamic languages with advanced JITs, such as the recent JavaScript implementations, hit a performance barrier imposed by their in-memory object representations. Even Java, which has a reputation of being very fast as far as managed, GC'd languages go, suffers here. While Java  can pack scalar data into a single instance (so if an object has three float fields, the float data is stored directly in the object); however it does not offer arrays of objects where the objects are stored directly in the array. This negatively impacts performance, as you will see.&lt;br /&gt;&lt;br /&gt;There is a buzzword that describes this approach of dealing with data: value semantics. If you find some whiny C++ programmers, soon enough one of them will mumble something about value semantics, and with good reason: because C++ offers value semantics, it often has a performance edge over other programming languages. While production-ready Java and C++ implementations both implement a similar set of optimizations, language semantics contribute to C++ being faster for a lot of code. Of course, C++ is low-level and has unsafe pointers; however, as Factor demonstrates, you can have a high-level managed language that still provides support for value semantics in a safe way.&lt;br /&gt;&lt;br /&gt;I decided to whip up a simple benchmark. Here is what it does:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;It works on points, which are triplets of single-precision floats, (x,y,z)&lt;/li&gt;&lt;li&gt;First, the benchmark creates a list of 5000000 points for i=0..4999999, where the ith point is (sin(i),cos(i)*3,sin(i)*sin(i)/2).&lt;/li&gt;&lt;li&gt;Then, each point is normalized; the x, y, and z components are divided by sqrt(x*x+y*y+z*z).&lt;/li&gt;&lt;li&gt;Finally, the maximum x, y, and z components are found, for all points, and this is printed out.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;Note that in-place operations, and re-using temporary objects, is allowed.&lt;br /&gt;&lt;br /&gt;Here is the code:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://paste.factorcode.org/paste?id=837"&gt;Factor&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://paste.factorcode.org/paste?id=838"&gt;Java&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;The Factor version is both shorter, and has more blank lines.&lt;br /&gt;&lt;br /&gt;Note that the Factor version is intended to be run as a vocabulary from the Factor environment, using the &lt;code&gt;time&lt;/code&gt; word, as follows:&lt;br /&gt;&lt;pre&gt;[ "benchmark.struct-arrays" run ] time&lt;/pre&gt;&lt;br /&gt;Run it a few of times so that the data heap can grow to a stable size.&lt;br /&gt;&lt;br /&gt;The Java code is self-contained; run it with&lt;br /&gt;&lt;pre&gt;java -Xms512m -server silly&lt;/pre&gt;&lt;br /&gt;The Java version runs the benchmark 8 times, and prints the best time; this gives the HotSpot Server JIT a chance to 'warm up'.&lt;br /&gt;&lt;br /&gt;The JVM shipped with Mac OS X 10.5 (build 1.5.0_19-b02-304) runs the benchmark in 4.6 seconds, and the Factor version ran in 2.2 seconds using a Factor build from earlier this evening. I made a couple of further improvements to the Factor compiler, bringing the runtime of the Factor version of the benchmark down to 1.4 seconds in the latest source tree:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Added intrinsics for &lt;code&gt;min&lt;/code&gt; and &lt;code&gt;max&lt;/code&gt; so that when they are applied to values known to be floating point at compile-time, the SSE2 instructions MINSD and MAXSD are used.&lt;/li&gt;&lt;li&gt;Added some optimizations to unbox intermediate &lt;code&gt;&amp;lt;displaced-alien&gt;&lt;/code&gt; values constructed by the struct array code. A displaced alien is a value representing a pointer and an offset; it is a heap-allocated object with two slots. This is how struct classes internally implement the backing store for objects which are stored 'in the middle' of a byte array.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;The Factor code is several layers of abstraction removed from the low-level machine code that is generated in the end. It takes the following optimizations to get good performance out of it:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;All words declared &lt;code&gt;inline&lt;/code&gt; are inlined, all higher-order functions are inlined.&lt;/li&gt;&lt;li&gt;Literal quotations (written in square brackets) are inlined at their call sites.&lt;/li&gt;&lt;li&gt;Macros, such as &lt;code&gt;sum-outputs&lt;/code&gt; are expanded.&lt;li&gt;Sequence words, struct field accessors, and generic math operations are all replaced with lower-level type-specific equivalents using type inference; in many cases, these operations, such as adding floats, map directly to machine instructions&lt;/li&gt;&lt;li&gt;The incrementing counter for initializing points is converted into a fixed-precision integer because interval and def-use analysis determine it is safe to do so&lt;/li&gt;&lt;li&gt;Escape analysis determines that various temporary objects can be stack-allocated:&lt;ul&gt;&lt;li&gt;closures created by various combinators, such as &lt;code&gt;tri-curry&lt;/code&gt;, &lt;code&gt;each&lt;/code&gt;, and so on&lt;/li&gt;&lt;li&gt;the struct array object created with &lt;code&gt;&amp;lt;struct-array&gt;&lt;/code&gt;&lt;/li&gt;&lt;li&gt;the struct class instances created inside the many calls to &lt;code&gt;each&lt;/code&gt;&lt;/li&gt;&lt;li&gt;the struct created at the end to store the maximum value&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;Of the remaining memory allocations, those that create small objects are completely inlined, and do not call into the VM at all&lt;/li&gt;&lt;li&gt;Stack analysis eliminates most of the runtime data stack manipulation in favor of keeping values in CPU registers&lt;/li&gt;&lt;li&gt;Representation analysis figures out that all of the intermediate float values can be unboxed and stored in floating point registers, except for this that live across the sin/cos calls&lt;/li&gt;&lt;li&gt;While the sin and cos functions result in calls into libc, sqrt is open-coded as an SSE2 instruction&lt;/li&gt;&lt;li&gt;Value numbering eliminates redundant pointer arithmetic and the boxing/unboxing of pointer values&lt;/li&gt;&lt;li&gt;GC checks are open-coded and inlined at every basic block which performs an allocation; the GC check compares the nursery pointer against the limit, and in the fast case, where the nursery is not full, no subroutine call is made, and no registers need to be saved and restored&lt;/li&gt;&lt;li&gt;The coalescing pass eliminates unnecessary register to register copy instructions that arise from SSA form, as well as ensuring that in most cases, the result of an arithmetic instruction is the same as the first operand; this helps x86 instruction selection&lt;/li&gt;&lt;li&gt;The register allocator assigns virtual registers to real CPU registers; in this case there is no spilling at all on x86-64&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;So what does this say about language expressiveness? Any dynamic language that a) offers the same set of metaprogramming tools as Factor b) has a reasonably advanced JIT could do the same. On the other hand, working value semantics into something like Java is pretty much impossible. I invite you implement this benchmark in your favorite language and share your timings.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17087850-5777691968367622335?l=factor-language.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://factor-language.blogspot.com/feeds/5777691968367622335/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17087850&amp;postID=5777691968367622335' title='34 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/5777691968367622335'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/5777691968367622335'/><link rel='alternate' type='text/html' href='http://factor-language.blogspot.com/2009/08/performance-comparison-between-factor.html' title='Performance comparison between Factor and Java on a contrived benchmark'/><author><name>Slava Pestov</name><uri>http://www.blogger.com/profile/02768382790667979877</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>34</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17087850.post-3898226712409523579</id><published>2009-08-24T21:59:00.006-04:00</published><updated>2009-08-24T22:43:07.820-04:00</updated><title type='text'>New tool for locating external resource leaks</title><content type='html'>While having garbage collection solves the problem of manually deallocating a block of memory when its lifetime ends, it doesn't help with external resources, such as sockets and file handles, at all. And even in a GC'd language, memory has to be managed manually sometimes, for example when passing data to and from a C library using FFI. So Factor has had a &lt;a href="http://factor-language.blogspot.com/2008/01/generic-resource-disposal.html"&gt;generic disposal&lt;/a&gt; protocol, as well as &lt;a href="http://code-factor.blogspot.com/2007/09/destructors-in-factor.html"&gt;destructors&lt;/a&gt;, for quite some time. What was missing was tooling support.&lt;br /&gt;&lt;br /&gt;I wanted to build a tool that would help me debug code that was leaking external resources by forgetting to dispose them. Thankfully, this doesn't come up often; as C++ programmers using RAII like to note, scoped destructors solve resource management in 90% of all cases, and Factor's resource management combinators are even more flexible. However, sometimes an external resource can have a complex lifetime, because of caching, pooling, and other advanced idioms. In these cases, having tools to help track leaks down can really help.&lt;br /&gt;&lt;br /&gt; I took inspiration from Doug Coleman's &lt;a href="http://code-factor.blogspot.com/2007/08/managed-malloc-and-free.html"&gt;managed malloc and free&lt;/a&gt; implementation. Whereas some languages use finalizers to ensure that an external resource gets cleaned up if you forget to dispose of it explicitly, I take the opposite approach; all active disposable objects are stored in a central registry, explicitly preventing the GC from cleaning them up.&lt;br /&gt;&lt;br /&gt;The new &lt;code&gt;tools.destructors&lt;/code&gt; vocabulary introduces two words, &lt;code&gt;disposables.&lt;/code&gt; and &lt;code&gt;leaks&lt;/code&gt;. The first word prints a tally of all active resources:&lt;br /&gt;&lt;img src="http://factorcode.org/disposables-list.png"&gt;&lt;br /&gt;Clicking 'list instances' next to a disposable class opens an inspector window with all active instances of this resource; for example, here we can list all file descriptors that Factor has open right now:&lt;br /&gt;&lt;img src="http://factorcode.org/disposable-instances-list.png"&gt;&lt;br /&gt;You can even dispose of the resource right there, as if you had called the &lt;a href=""&gt;dispose&lt;/a&gt; word on it:&lt;br /&gt;&lt;img src="http://factorcode.org/disposable-menu.png"&gt;&lt;br /&gt;The &lt;code&gt;leaks&lt;/code&gt; combinator compares active resources before and after a quotation runs, and lists any that were not disposed of. For example, in the following, I construct a &lt;code&gt;file-reader&lt;/code&gt;, read a line, and never dispose of it:&lt;br /&gt;&lt;img src="http://factorcode.org/leaks-bad.png"&gt;&lt;br /&gt;Notice how a file descriptor, together with a few associated resources, was leaked as a result of this.&lt;br /&gt;&lt;br /&gt;If I wanted to, I could click on 'show instances' and dispose of the &lt;code&gt;input-port&lt;/code&gt; in the inspector; this would dispose of the other two associated objects as well. Also, opening the inspector on a disposable resource that was allocated inside a &lt;code&gt;leaks&lt;/code&gt; call will display a stack trace, showing where in the code the resource was allocated. This can help pinpoint the offending code.&lt;br /&gt;&lt;br /&gt;Of course, the correct way to read a line from a file in Factor is to use a combinator which cleans up the stream for you. In the following example, the resource leak has been fixed:&lt;br /&gt;&lt;img src="http://factorcode.org/leaks-good.png"&gt;&lt;br /&gt;To define a new disposable resource, simply create a tuple class that subclasses &lt;code&gt;disposable&lt;/code&gt;, make sure to construct it with &lt;code&gt;new-disposable&lt;/code&gt;, and override the &lt;code&gt;dispose*&lt;/code&gt; method.&lt;br /&gt;&lt;br /&gt;Here is an example that manages a limited pool of "X"s, of which there are 100 total:&lt;br /&gt;&lt;pre&gt;SYMBOL: xs&lt;br /&gt;&lt;br /&gt;100 &gt;vector xs set-global&lt;br /&gt;&lt;br /&gt;: get-x ( -- x ) xs get pop ;&lt;br /&gt;: return-x ( x -- ) xs get push ;&lt;br /&gt;&lt;br /&gt;TUPLE: my-disposable-resource &lt; disposable x ;&lt;br /&gt;&lt;br /&gt;: &amp;lt;my-disposable-resource&gt; ( -- disposable )&lt;br /&gt;    my-disposable-resource new-disposable get-x &gt;&gt;x ;&lt;br /&gt;&lt;br /&gt;M: my-disposable-resource dispose* x&gt;&gt; return-x ;&lt;/pre&gt;&lt;br /&gt;The disposable resource machinery is built with very little code, and it is used throughout Factor already.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17087850-3898226712409523579?l=factor-language.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://factor-language.blogspot.com/feeds/3898226712409523579/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17087850&amp;postID=3898226712409523579' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/3898226712409523579'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/3898226712409523579'/><link rel='alternate' type='text/html' href='http://factor-language.blogspot.com/2009/08/new-tool-for-locating-external-resource.html' title='New tool for locating external resource leaks'/><author><name>Slava Pestov</name><uri>http://www.blogger.com/profile/02768382790667979877</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17087850.post-3120900726542440892</id><published>2009-08-24T18:40:00.005-04:00</published><updated>2009-08-24T19:04:49.207-04:00</updated><title type='text'>Status of the PowerPC port, and Snow Leopard's imminent release</title><content type='html'>Factor has had a PowerPC port since some time in 2004, when I got it running on my old G3 iMac, and I continued working on it after I switched to a Power Mac G5 as my primary workstation. However, since then, Apple decided to go with Intel chips, and I sold my G5, so I no longer own any PowerPC gear. As a result, the Factor PowerPC port has been neglected for a while now. Right now, the code in the master branch of the git repository bootstraps on PowerPC and basic things seem to work, but there are stability issues (it doesn't manage to finish a full continuous integration run). While I'll probably fix this particular issue and ensure that up to date binaries get uploaded, I can't help but wonder, now that OS X 10.6 is around the corner (without PowerPC support), is it worth my time to keep the PowerPC port going?&lt;br /&gt;&lt;br /&gt;In addition to stability issues, the PowerPC port lags behind a bit as far as code quality goes. It doesn't do instruction scheduling, nor does it make use of some unique PowerPC features that would help improve performance (multiple condition registers, etc). AltiVec isn't used at all.&lt;br /&gt;&lt;br /&gt;I have a couple of questions for the community:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Does anyone still care about PowerPC at all? I might be out of the loop here; is it about as relevant nowadays as DEC Alpha, or PA-RISC? Other than the game consoles and IBM's big iron gear, who uses it?&lt;/li&gt;&lt;li&gt;Is there any interest at all in a Linux/PowerPC port of Factor?&lt;br /&gt;I guess over time, people who still own older PowerPC-based Macs are going to switch to Linux, as Apple gradually ends issuing security updates and so on for PowerPC systems&lt;/li&gt;&lt;li&gt;Is anyone interested in taking over development of Factor's PowerPC support? It's only 1600 lines of code in &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=tree;f=basis/cpu/ppc;hb=HEAD"&gt;basis/cpu/ppc&lt;/a&gt;, the assembler is simple to use and implemented declaratively, and the compiler's backend interface is well abstracted and consists of a few dozen small assembly templates. A lot of contributors have more than enough Factor experience to take this one on, I think. If you know some assembly language already, compiler backends are really not that complex in Factor, it just takes time to test them thoroughly.&lt;/li&gt;&lt;li&gt;Should I forget about PowerPC altogether, and spend the time that I would spend on it on reviving the Factor/ARM port instead? Last time I messed around with ARM (a few years ago) contemporary devices were pretty short on RAM; nowadays 256Mb RAM is starting to become common, and Factor would run much better.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;br /&gt;Any comments on this matter are appreciated.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17087850-3120900726542440892?l=factor-language.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://factor-language.blogspot.com/feeds/3120900726542440892/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17087850&amp;postID=3120900726542440892' title='13 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/3120900726542440892'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/3120900726542440892'/><link rel='alternate' type='text/html' href='http://factor-language.blogspot.com/2009/08/status-of-powerpc-port-and-snow.html' title='Status of the PowerPC port, and Snow Leopard&apos;s imminent release'/><author><name>Slava Pestov</name><uri>http://www.blogger.com/profile/02768382790667979877</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>13</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17087850.post-1867720023512508933</id><published>2009-08-10T17:21:00.009-04:00</published><updated>2009-08-10T21:23:28.087-04:00</updated><title type='text'>Global float unboxing and some other optimizations</title><content type='html'>In compiler implementation, "global" optimizations are those that apply to an entire procedure at a time, in contrast to "local" optimizations, which operate on individual basic blocks only. After some work, float unboxing is the latest optimization which is now done globally in Factor.&lt;br /&gt;&lt;br /&gt;Factor's float unboxing optimization used to be part of &lt;a href="http://factor-language.blogspot.com/2008/11/new-low-level-optimizer-and-code.html"&gt;local value numbering&lt;/a&gt;. Originally I planned on implementing global value numbering and trying to do float unboxing there, but since then I discovered a better way. I also improved the SSA coalescing pass I talked about &lt;a href="http://factor-language.blogspot.com/2009/07/dataflow-analysis-computing-dominance.html"&gt;last time&lt;/a&gt;, and implemented some optimizations which help compile the code in the &lt;a href="http://docs.factorcode.org/content/vocab-math.vectors.html"&gt;math.vectors&lt;/a&gt; vocabulary more efficiently.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Global float unboxing&lt;/h3&gt;&lt;br /&gt;Instead of modeling float unboxing as an algebraic simplification problem (where &lt;code&gt;unbox(box(x))&lt;/code&gt; can be rewritten as &lt;code&gt;x&lt;/code&gt;) I now model it as a representation problem, ie, the optimizer makes a global decision, for every virtual register, whether to store it as a tagged pointer, or an unboxed float.&lt;br /&gt;&lt;br /&gt;When the low-level IR is built, no boxing or unboxing operations are inserted, and the compiler assumes all floating point values will end up unboxed. The new &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=tree;f=basis/compiler/cfg/representations;hb=HEAD"&gt;compiler.cfg.representations&lt;/a&gt; pass is responsible for inserting just enough boxing and unboxing operations to make the resulting code valid. This pass consists of several step.&lt;br /&gt;&lt;br /&gt;&lt;h4&gt;Representations and register classes&lt;/h4&gt;&lt;br /&gt;A representation is like a low-level data type associated to a virtual register. At this stage, possible repesentations are tagged pointers, single-precision floats, and double-precision floats. Note that fixnums (which are stored directly inside a pointer, rather than being heap allocated) are considered as tagged pointers, and that single-precision floats are only used by the FFI. In the future, I will also have a representation type for untagged integers (so that Factor code can perform arithmetic on 32 and 64-bit numbers without boxing them) and various SSE vector types.&lt;br /&gt;&lt;br /&gt;A register class is a set of machine registers that instructions can operate on. Every representation has a register class, and values in this representation must be stored in registers of that class. Multiple representations can share a register class; for example, single and double precision floats all go into floating point registers. The register allocator deals with register classes.&lt;br /&gt;&lt;br /&gt;When the low-level IR is built, virtual registers do not have representations assigned to them. Instead, this mapping is established later, by the &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=tree;f=basis/compiler/cfg/representations;hb=HEAD"&gt;compiler.cfg.representations&lt;/a&gt; pass.&lt;br /&gt;&lt;br /&gt;&lt;h4&gt;First step: compute possible representations for each virtual register&lt;/h4&gt;&lt;br /&gt;The first step to deciding what representation to use for a virtual register is to take stock of the possibilities. The &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=tree;f=basis/compiler/cfg/representations/preferred;hb=HEAD"&gt;compiler.cfg.representations.preferred&lt;/a&gt; vocabulary defines words which output, for each instruction, the preferred representation of the input values, and the preferred representation of the output value, if any.&lt;br /&gt;&lt;br /&gt;For example, the &lt;code&gt;##slot&lt;/code&gt; instruction, which accesses an object's slot, prefers that its inputs and outputs be tagged pointers. On the other hand, &lt;code&gt;##add-float&lt;/code&gt; prefers that its inputs and outputs are unboxed floats (for otherwise, boxing is required), and &lt;code&gt;##integer&gt;float&lt;/code&gt; takes an integer but outputs an unboxed float.&lt;br /&gt; &lt;br /&gt;This information is collected into a single mapping, from virtual registers to sequences of representations.&lt;br /&gt;&lt;br /&gt;&lt;h4&gt;Second step: computing costs for different representations&lt;/h4&gt;&lt;br /&gt;The next step is to compute the cost of keeping each virtual register in every possible representation for that register. The compiler uses a simple cost heuristic. The cost of keeping a register in a given representation &lt;code&gt;R&lt;/code&gt; is obtained by iterating over all  instructions that use that register, but expect to receive the value in a different representation from &lt;code&gt;R&lt;/code&gt;. For each such instruction, its loop nesting depth is added up. Finally, if the representation &lt;code&gt;R&lt;/code&gt; differs from the output representation of the instruction that computed the register, the loop nesting is added to the tally. This gives an approximation of the overhead of boxing and unboxing the value.&lt;br /&gt;&lt;br /&gt;To calculate the loop nesting depth of an instruction, I use the standard natural loop detection algorithm. This is implemented in &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=tree;f=basis/compiler/cfg/loop-detection;hb=HEAD"&gt;compiler.cfg.loop-detection&lt;/a&gt;. The details can be found in any moderately advanced compiler textbook; for each basic block, it identifies a set of loops this basic block belongs to (this set may be empty). A loop consists of a header, a set of more or more end blocks, and a set of blocks making up this loop. Multiple loops can share a header, but each loop can only have a single header because all control flow graphs that the low-level optimizer constructs are reducible (at least I think so).&lt;br /&gt;&lt;br /&gt;&lt;h4&gt;Third step: representation selection&lt;/h4&gt;&lt;br /&gt;Once costs have been computed, it is easy to assign to each virtual register the representation that has the least cost. There is one subtlety here though; if an instruction outputs a tagged pointer, no other representation may be used for that virtual register. For example, in this code, it would not be sound to pick an unboxed float representation for the input parameter, since in one branch, it is used as a fixnum, and trying to perform a float unboxing on a fixnum will just crash:&lt;br /&gt;&lt;pre&gt;: silly-word ( x -- y )&lt;br /&gt;    {&lt;br /&gt;        { [ dup fixnum? ] [ 1 + ] }&lt;br /&gt;        { [ dup float? ] [ 3 * ] }&lt;br /&gt;    } cond ;&lt;/pre&gt;&lt;br /&gt;The problem here is that in low-level IR, the &lt;code&gt;##peek&lt;/code&gt; instruction which loads the value from the data stack outputs a tagged value, and &lt;i&gt;a priori&lt;/i&gt; the compiler cannot assume its going to be a boxed float that can be safely unboxed -- only the output of a float-producing instruction can be unboxed. A more complicated unboxing algorithm based on control-dependence was suggested to me by &lt;a href="http://useless-factor.blogspot.com"&gt;Daniel Ehrenberg&lt;/a&gt;. However, because the low-level IR is single-assignment, the potential overhead from the current unboxing scheme is not so great, and so I'm not worried about it right now.&lt;br /&gt;&lt;br /&gt;&lt;h4&gt;Fourth step: inserting boxing and unboxing instructions&lt;/h4&gt;&lt;br /&gt;Once the compiler has determined the representation to use for each virtual register, it must now iterate over all instructions again, this time, inserting conversions whenever a virtual register is used with a representation other than the one that was chosen for it. This pass introduces new temporaries, and renames virtual registers used and defined in instructions, in order to preserve single assignment. Other than that, it is relatively straightforward.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Live-range coalescing revisited&lt;/h3&gt;&lt;br /&gt;Last time I mentioned that I had implemented a rather complex algorithm for converting out of SSA form and doing copy coalescing at the same time. After having some problems getting the algorithm working in certain corner cases, I decided to scrap it and implement something simpler instead. My new SSA coalescing algorithm uses the classical interference graph approach, except it doesn't actually have to build the interference graph, since interference testing on SSA form can be done in O(1) time.&lt;br /&gt;&lt;br /&gt;The register allocator used to do its own coalescing, to eliminate copies inserted by the two-operand conversion pass. The two-operand pass converts&lt;br /&gt;&lt;pre&gt;x = y + z&lt;/pre&gt;&lt;br /&gt;into&lt;br /&gt;&lt;pre&gt;x = y; x = x + z&lt;/pre&gt;&lt;br /&gt;(and similarly for all other arithmetic operations). This makes arithmetic operations fit the x86 instruction encoding scheme, where the destination register is also the first register. Implementing copy coalescing twice, in two different ways, seemed inelegant to me. A big breakthrough was when I was able to get my SSA coalescing algorithm working on a relaxed SSA form, where values can be reassigned more than once, but only within the same basic block where they are defined. This means that the two-operand conversion pass can run before SSA coalescing, and the register allocator does not need to perform coalescing at all.&lt;br /&gt;&lt;br /&gt;The new coalescing algorithm was not noticably slower than the old one (it still benefits from not having to construct an interference graph), the code is much simpler, and the register allocator's copy coalescing code is gone. The compiler feels cleaner as a result.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Improved tuple unboxing&lt;/h3&gt;&lt;br /&gt;I implemented &lt;a href="http://factor-language.blogspot.com/2008/08/algorithm-for-escape-analysis.html"&gt;escape analysis for tuple unboxing&lt;/a&gt; last year and recently came up with a way to improve it. Whereas previously, only freshly-allocated tuples could be unboxed, I realized that if an input parameter to a word is known to be an instance of an immutable tuple class, and the input parameter does not escape (ie, not passed to a subroutine call, or returned) then it can be unboxed, too. Whereas unboxing a freshly-allocated tuple involves deleting the allocation call, and replacing slot access with references to the inputs to the allocation call, unboxing an input parameter is a bit trickier. Actual code has to be inserted at the beginning of the word, which unpacks the tuple. This optimization is only a win if the same slot is accessed more than once within a word; in particular, its a win if the slot access occurs inside a loop. This situation came up with &lt;a href="http://docs.factorcode.org/content/article-specialized-arrays.html"&gt;specialized arrays&lt;/a&gt;; a specialized array is a tuple with two slots, a byte array, and a length. To access the nth element of a specialized array involves an indirection, because the byte array has to be retrieved. By unboxing the specialized array before a loop, a slot access is eliminated from the loop. This gives the high-level optimizer a form of loop-invariant code motion (real loop-invariant code motion might appear in the low-level optimizer at some point, too).&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Vector operations on specialized arrays&lt;/h3&gt;&lt;br /&gt;I used to use &lt;a href="http://docs.factorcode.org/content/article-hints.html"&gt;hints&lt;/a&gt; to make the &lt;a href="http://docs.factorcode.org/content/vocab-math.vectors.html"&gt;math.vectors&lt;/a&gt; vocabulary faster. While vector words could be used on any sequence type, hints would automatically compile a specialized version of each vector word for the &lt;code&gt;double-array&lt;/code&gt; type. This gave a nice speedup on benchmarks which performed vector operations on arrays of double-precision floats, but it didn't help at all for vector operations on other specialized arrays, and there was still at least one runtime dispatch per vector operation, to determine if the generic or hinted version should be used.&lt;br /&gt;&lt;br /&gt;While hints have their place, I don't use them for vector words anymore. Instead, I cooked up some special compiler extensions for them (easier than it sounds; the compiler is pretty extensible, at least the pass that I needed to extend for this). Now, every specialized array type compiles a specialized set of vector words, and if the input types of a vector word are known at compile time, the right specialized version is used instead. Of course there's no actual code duplication here, the code is generated at runtime.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Benchmark results&lt;/h3&gt;&lt;br /&gt;The above improvements, as well as the changes I outlined in my &lt;a href="http://factor-language.blogspot.com/2009/07/dataflow-analysis-computing-dominance.html"&gt;previous post&lt;/a&gt;, give a nice speedup since &lt;a href="http://factor-language.blogspot.com/2009/07/improved-value-numbering-branch.html"&gt;the last time I did some benchmarks&lt;/a&gt;.&lt;br /&gt;&lt;table border="1"&gt; &lt;tr&gt;&lt;th&gt;Benchmark&lt;/th&gt;&lt;th&gt;May 31st&lt;/th&gt;&lt;th&gt;July 17&lt;/th&gt;&lt;th&gt;Aug 10&lt;/th&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.backtrack          &lt;/td&gt;&lt;td&gt;1.767561         &lt;/td&gt;&lt;td&gt;1.330641           &lt;/td&gt;&lt;td&gt;1.111011               &lt;/td&gt;  &lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.base64             &lt;/td&gt;&lt;td&gt;1.997951         &lt;/td&gt;&lt;td&gt;1.738677           &lt;/td&gt;&lt;td&gt;1.536447               &lt;/td&gt;  &lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.beust1             &lt;/td&gt;&lt;td&gt;2.765257         &lt;/td&gt;&lt;td&gt;2.461088           &lt;/td&gt;&lt;td&gt;2.360084               &lt;/td&gt;  &lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.beust2             &lt;/td&gt;&lt;td&gt;3.584958         &lt;/td&gt;&lt;td&gt;1.694427           &lt;/td&gt;&lt;td&gt;1.691499               &lt;/td&gt;  &lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.binary-search      &lt;/td&gt;&lt;td&gt;1.55002          &lt;/td&gt;&lt;td&gt;1.574595           &lt;/td&gt;&lt;td&gt;1.585475               &lt;/td&gt;  &lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.binary-trees       &lt;/td&gt;&lt;td&gt;1.845798         &lt;/td&gt;&lt;td&gt;1.733355           &lt;/td&gt;&lt;td&gt;1.771849               &lt;/td&gt;  &lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.bootstrap1         &lt;/td&gt;&lt;td&gt;10.860492        &lt;/td&gt;&lt;td&gt;11.447687          &lt;/td&gt;&lt;td&gt;9.962971               &lt;/td&gt;  &lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.dawes              &lt;/td&gt;&lt;td&gt;0.229999         &lt;/td&gt;&lt;td&gt;0.161726           &lt;/td&gt;&lt;td&gt;0.104693               &lt;/td&gt;  &lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.dispatch1          &lt;/td&gt;&lt;td&gt;2.015653         &lt;/td&gt;&lt;td&gt;2.119268           &lt;/td&gt;&lt;td&gt;2.252401               &lt;/td&gt;  &lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.dispatch2          &lt;/td&gt;&lt;td&gt;1.817941         &lt;/td&gt;&lt;td&gt;1.216618           &lt;/td&gt;&lt;td&gt;1.629185               &lt;/td&gt;  &lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.dispatch3          &lt;/td&gt;&lt;td&gt;2.568987         &lt;/td&gt;&lt;td&gt;1.899128           &lt;/td&gt;&lt;td&gt;2.692071               &lt;/td&gt;  &lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.dispatch4          &lt;/td&gt;&lt;td&gt;2.319587         &lt;/td&gt;&lt;td&gt;2.032182           &lt;/td&gt;&lt;td&gt;2.073563               &lt;/td&gt;  &lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.dispatch5          &lt;/td&gt;&lt;td&gt;2.346744         &lt;/td&gt;&lt;td&gt;1.614045           &lt;/td&gt;&lt;td&gt;1.369721               &lt;/td&gt;  &lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.empty-loop-0       &lt;/td&gt;&lt;td&gt;0.146716         &lt;/td&gt;&lt;td&gt;0.12589            &lt;/td&gt;&lt;td&gt;0.12608                &lt;/td&gt;  &lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.empty-loop-1       &lt;/td&gt;&lt;td&gt;0.430314         &lt;/td&gt;&lt;td&gt;0.342426           &lt;/td&gt;&lt;td&gt;0.351241               &lt;/td&gt;  &lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.empty-loop-2       &lt;/td&gt;&lt;td&gt;0.429012         &lt;/td&gt;&lt;td&gt;0.342097           &lt;/td&gt;&lt;td&gt;0.350182               &lt;/td&gt;  &lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.euler150           &lt;/td&gt;&lt;td&gt;16.901451        &lt;/td&gt;&lt;td&gt;15.288867          &lt;/td&gt;&lt;td&gt;13.046828              &lt;/td&gt;  &lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.euler186           &lt;/td&gt;&lt;td&gt;8.805434999999999&lt;/td&gt;&lt;td&gt;7.920478           &lt;/td&gt;&lt;td&gt;7.997483               &lt;/td&gt;  &lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.fannkuch           &lt;/td&gt;&lt;td&gt;3.202698         &lt;/td&gt;&lt;td&gt;2.964037           &lt;/td&gt;&lt;td&gt;2.859117               &lt;/td&gt;  &lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.fasta              &lt;/td&gt;&lt;td&gt;5.52608          &lt;/td&gt;&lt;td&gt;4.934112           &lt;/td&gt;&lt;td&gt;5.135706               &lt;/td&gt;  &lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.gc0                &lt;/td&gt;&lt;td&gt;2.15066          &lt;/td&gt;&lt;td&gt;1.993158           &lt;/td&gt;&lt;td&gt;2.03638                &lt;/td&gt;  &lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.gc1                &lt;/td&gt;&lt;td&gt;4.984841         &lt;/td&gt;&lt;td&gt;4.961272           &lt;/td&gt;&lt;td&gt;5.075487               &lt;/td&gt;  &lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.gc2                &lt;/td&gt;&lt;td&gt;3.327706         &lt;/td&gt;&lt;td&gt;3.265462           &lt;/td&gt;&lt;td&gt;3.376618               &lt;/td&gt;  &lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.iteration          &lt;/td&gt;&lt;td&gt;3.736756         &lt;/td&gt;&lt;td&gt;3.30438            &lt;/td&gt;&lt;td&gt;3.22603                &lt;/td&gt;  &lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.javascript         &lt;/td&gt;&lt;td&gt;9.79904          &lt;/td&gt;&lt;td&gt;9.164517           &lt;/td&gt;&lt;td&gt;9.000165000000001      &lt;/td&gt;  &lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.knucleotide        &lt;/td&gt;&lt;td&gt;0.282296         &lt;/td&gt;&lt;td&gt;0.251879           &lt;/td&gt;&lt;td&gt;0.231547               &lt;/td&gt;  &lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.mandel             &lt;/td&gt;&lt;td&gt;0.125304         &lt;/td&gt;&lt;td&gt;0.123945           &lt;/td&gt;&lt;td&gt;0.123321               &lt;/td&gt;  &lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.md5                &lt;/td&gt;&lt;td&gt;0.946516         &lt;/td&gt;&lt;td&gt;0.85062            &lt;/td&gt;&lt;td&gt;0.84516                &lt;/td&gt;  &lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.nbody              &lt;/td&gt;&lt;td&gt;3.982774         &lt;/td&gt;&lt;td&gt;3.349595           &lt;/td&gt;&lt;td&gt;2.204085               &lt;/td&gt;  &lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.nested-empty-loop-1&lt;/td&gt;&lt;td&gt;0.116351         &lt;/td&gt;&lt;td&gt;0.135936           &lt;/td&gt;&lt;td&gt;0.053609               &lt;/td&gt;  &lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.nested-empty-loop-2&lt;/td&gt;&lt;td&gt;0.692668         &lt;/td&gt;&lt;td&gt;0.438932           &lt;/td&gt;&lt;td&gt;0.390032               &lt;/td&gt;  &lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.nsieve             &lt;/td&gt;&lt;td&gt;0.714772         &lt;/td&gt;&lt;td&gt;0.698262           &lt;/td&gt;&lt;td&gt;0.694443               &lt;/td&gt;  &lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.nsieve-bits        &lt;/td&gt;&lt;td&gt;1.451828         &lt;/td&gt;&lt;td&gt;0.907247           &lt;/td&gt;&lt;td&gt;0.727394               &lt;/td&gt;  &lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.nsieve-bytes       &lt;/td&gt;&lt;td&gt;0.312481         &lt;/td&gt;&lt;td&gt;0.300053           &lt;/td&gt;&lt;td&gt;0.225218               &lt;/td&gt;  &lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.partial-sums       &lt;/td&gt;&lt;td&gt;1.205072         &lt;/td&gt;&lt;td&gt;1.221245           &lt;/td&gt;&lt;td&gt;1.152942               &lt;/td&gt;  &lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.random             &lt;/td&gt;&lt;td&gt;2.574773         &lt;/td&gt;&lt;td&gt;2.706893           &lt;/td&gt;&lt;td&gt;2.403702               &lt;/td&gt;  &lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.raytracer          &lt;/td&gt;&lt;td&gt;3.481714         &lt;/td&gt;&lt;td&gt;2.914116           &lt;/td&gt;&lt;td&gt;2.428384               &lt;/td&gt;  &lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.recursive          &lt;/td&gt;&lt;td&gt;5.964279         &lt;/td&gt;&lt;td&gt;3.215277           &lt;/td&gt;&lt;td&gt;3.178855               &lt;/td&gt;  &lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.regex-dna          &lt;/td&gt;&lt;td&gt;0.132406         &lt;/td&gt;&lt;td&gt;0.093095           &lt;/td&gt;&lt;td&gt;0.08619400000000001    &lt;/td&gt;  &lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.reverse-complement &lt;/td&gt;&lt;td&gt;3.811822         &lt;/td&gt;&lt;td&gt;3.257535           &lt;/td&gt;&lt;td&gt;3.003156               &lt;/td&gt;  &lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.ring               &lt;/td&gt;&lt;td&gt;1.756481         &lt;/td&gt;&lt;td&gt;1.79823            &lt;/td&gt;&lt;td&gt;1.696271               &lt;/td&gt;  &lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.sha1               &lt;/td&gt;&lt;td&gt;2.267648         &lt;/td&gt;&lt;td&gt;1.473887           &lt;/td&gt;&lt;td&gt;1.43003                &lt;/td&gt;  &lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.sockets            &lt;/td&gt;&lt;td&gt;8.794280000000001&lt;/td&gt;&lt;td&gt;8.783398           &lt;/td&gt;&lt;td&gt;8.643011               &lt;/td&gt;  &lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.sort               &lt;/td&gt;&lt;td&gt;0.421628         &lt;/td&gt;&lt;td&gt;0.363383           &lt;/td&gt;&lt;td&gt;0.367602               &lt;/td&gt;  &lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.spectral-norm      &lt;/td&gt;&lt;td&gt;3.830249         &lt;/td&gt;&lt;td&gt;4.036353           &lt;/td&gt;&lt;td&gt;3.277558               &lt;/td&gt;  &lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.stack              &lt;/td&gt;&lt;td&gt;2.086594         &lt;/td&gt;&lt;td&gt;1.014408           &lt;/td&gt;&lt;td&gt;0.91959                &lt;/td&gt;  &lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.sum-file           &lt;/td&gt;&lt;td&gt;0.528061         &lt;/td&gt;&lt;td&gt;0.422194           &lt;/td&gt;&lt;td&gt;0.383183               &lt;/td&gt;  &lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.tuple-arrays       &lt;/td&gt;&lt;td&gt;0.127335         &lt;/td&gt;&lt;td&gt;0.103421           &lt;/td&gt;&lt;td&gt;0.08908199999999999    &lt;/td&gt;  &lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.typecheck1         &lt;/td&gt;&lt;td&gt;0.876559         &lt;/td&gt;&lt;td&gt;0.6723440000000001 &lt;/td&gt;&lt;td&gt;0.744205               &lt;/td&gt;  &lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.typecheck2         &lt;/td&gt;&lt;td&gt;0.878561         &lt;/td&gt;&lt;td&gt;0.671624           &lt;/td&gt;&lt;td&gt;0.770056               &lt;/td&gt;  &lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.typecheck3         &lt;/td&gt;&lt;td&gt;0.86596          &lt;/td&gt;&lt;td&gt;0.670099           &lt;/td&gt;&lt;td&gt;0.699071               &lt;/td&gt;  &lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.ui-panes           &lt;/td&gt;&lt;td&gt;0.426701         &lt;/td&gt;&lt;td&gt;0.372301           &lt;/td&gt;&lt;td&gt;0.369152               &lt;/td&gt;  &lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.xml                &lt;/td&gt;&lt;td&gt;2.351934         &lt;/td&gt;&lt;td&gt;2.187999           &lt;/td&gt;&lt;td&gt;1.630145               &lt;/td&gt;  &lt;/tr&gt; &lt;/table&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17087850-1867720023512508933?l=factor-language.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://factor-language.blogspot.com/feeds/1867720023512508933/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17087850&amp;postID=1867720023512508933' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/1867720023512508933'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/1867720023512508933'/><link rel='alternate' type='text/html' href='http://factor-language.blogspot.com/2009/08/global-float-unboxing-and-some-other.html' title='Global float unboxing and some other optimizations'/><author><name>Slava Pestov</name><uri>http://www.blogger.com/profile/02768382790667979877</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17087850.post-7198889069604652714</id><published>2009-07-30T03:24:00.007-04:00</published><updated>2009-07-30T06:14:48.079-04:00</updated><title type='text'>Dataflow analysis, computing dominance, and converting to and from SSA form</title><content type='html'>As I mentioned in my &lt;a href="http://factor-language.blogspot.com/2009/07/improved-value-numbering-branch.html"&gt;previous post&lt;/a&gt;, I'm working on having the compiler backend take advantage of our &lt;a href="http://factor-language.blogspot.com/2009/07/improvements-to-factors-register.html"&gt;fancy new register allocator&lt;/a&gt; by making better use of registers than we do now.&lt;br /&gt;&lt;br /&gt;As with the previous post, there are lots of links to Factor code here. Also, since much of what I implemented comes from academic papers, I've linked to the relevant literature also.&lt;br /&gt;&lt;br /&gt;Over the course of the last year, Factor went from a rudimentary scheme where some intermediate values are placed in registers, to local register allocation for all temporaries in a basic block, to the current system where registers can remain live between basic blocks and values are only stored on the data stack when absolutely necessary; subroutine calls, returns, and points in the procedure from where the value will not be used again.&lt;br /&gt;&lt;br /&gt;Before I dive in to the technical details, here is a taste of what the new code generator can do. The following Factor word,&lt;br /&gt;&lt;pre&gt;: counted-loop-test ( -- ) 1000000000 [ ] times ;&lt;/pre&gt;&lt;br /&gt;Compiles to the following x86-64 machine code:&lt;br /&gt;&lt;pre&gt;000000010c73b2a0: 48bd0050d6dc01000000  mov rbp, 0x1dcd65000&lt;br /&gt;000000010c73b2aa: 4831c0                xor rax, rax&lt;br /&gt;000000010c73b2ad: 4983c610              add r14, 0x10&lt;br /&gt;000000010c73b2b1: e904000000            jmp 0x10c73b2ba&lt;br /&gt;000000010c73b2b6: 4883c008              add rax, 0x8&lt;br /&gt;000000010c73b2ba: 4839e8                cmp rax, rbp&lt;br /&gt;000000010c73b2bd: 0f8cf3ffffff          jl dword 0x10c73b2b6&lt;br /&gt;000000010c73b2c3: 4983ee10              sub r14, 0x10&lt;br /&gt;000000010c73b2c7: c3                    ret &lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;General overview&lt;/h3&gt;&lt;br /&gt;The general structure of the low-level optimizer still remains, along with the optimizations it performs -- alias analysis, value numbering, dead code elimination, and so on. However, what has changed in a major way is how the low-level IR is constructed, and how it is gets converted out of SSA. There are also two new abstractions used by the compiler; a dataflow analysis framework, and a meta-programming utility for renaming words.&lt;br /&gt;&lt;br /&gt;In the low-level IR, "peek" and "replace" instructions are used to read and write stack locations, making stack traffic explicit. Peek instructions output a virtual register, and replace instructions take a register as input. All other instructions operate on virtual registers, not stack locations.&lt;br /&gt;&lt;br /&gt;When building the &lt;a href="http://en.wikipedia.org/wiki/Control_flow_graph"&gt;control flow graph&lt;/a&gt;, the &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=tree;f=basis/compiler/cfg/builder;hb=HEAD"&gt;compiler.cfg.builder&lt;/a&gt; vocabulary used to insert "peek" and "replace" instructions at every use of a stack location, and subsequent redundancy elimination passes would get rid of the ones that are deemed redundant -- where either the relevant stack location was in a register already (peek after peek or peek after replace), or because it would be overwritten before being read again (replace after replace).&lt;br /&gt;&lt;br /&gt;Now, the CFG builder doesn't insert peeks and replaces at all, and simply associates with each basic block a set of stack locations that it reads, and a set of stack locations that it writes. For each stack location, there is a globally unique virtual register which is used for it; instructions which need a stack location simply refer to that fixed virtual register (or assign to it). The last step of the CFG builder runs a dataflow analysis and actually inserts peek and replace instructions on the right edges in the CFG, mostly to ensure the invariant that values are saved to the stack between subroutine calls, that all values that are needed from the stack get loaded at some point, and that everything it saved to the stack eventually before the procedure returns. The inserted peeks and replaces reference the stack location's global fixed virtual register.&lt;br /&gt;&lt;br /&gt;However, one thing in the above construction is that the output of the CFG builder is no longer in SSA form, like it used to be. However, the problem of converting applicative languages into SSA form is well-known, and so now I have an explicit SSA construction pass which runs after the CFG builder, before any other optimizations which operate on values. After optimizations, SSA form is eliminated by converting phi instructions into copies, at which point the result is passed on to the machine register allocator.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Dataflow analysis&lt;/h3&gt;&lt;br /&gt;The wikipedia page on &lt;a href="http://en.wikipedia.org/wiki/Data-flow_analysis"&gt;dataflow analysis&lt;/a&gt; gives a good overview of the topic. Simple dataflow analyses with a single direction of flow all look quite similar, and there are many ways to abstract out the duplication. Since my code to convert stack-based code into low-level IR requires four dataflow analysis passes, and register allocation and SSA construction perform liveness analysis, I needed to eliminate the 5x code duplication that would result from naive implementation.&lt;br /&gt;&lt;br /&gt;I went with an object-oriented approach, where a set of generic words are defined, together with a top-level driver word which takes an instance supporting these generic words. The generic words compute local sets, implement the join operation, and determine the direction of flow. The code is defined in the &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=tree;f=basis/compiler/cfg/dataflow-analysis;hb=HEAD"&gt;compiler.cfg.dataflow-analysis&lt;/a&gt; vocabulary. In addition to the generic words, I also define a couple of parsing words which create a new instance of the dataflow analysis class, together with some supporting boilerplate for looking up basic block's in and out sets after analysis has run. The actual content of the sets propagated depends on the analysis itself; liveness deals with virtual registers, and stack analysis deals with stack locations.&lt;br /&gt;&lt;br /&gt;Because of this framework, adding new analyses is very easy. &lt;a href="http://en.wikipedia.org/wiki/Liveness_analysis"&gt;Liveness analysis&lt;/a&gt;, implemented in the &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=tree;f=basis/compiler/cfg/liveness;hb=HEAD"&gt;compiler.cfg.liveness&lt;/a&gt; vocabulary, is only a few lines of code.&lt;br /&gt;&lt;br /&gt;The dataflow analysis framework does not take &lt;a href="http://en.wikipedia.org/wiki/Static_single_assignment_form"&gt;SSA phi instructions&lt;/a&gt; into account, because stack analysis operates on stack locations (which are not single assignment) and liveness analysis is performed in SSA construction, when no phi instructions have been added yet, as well as register allocation, after SSA destruction.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Renaming values&lt;/h3&gt;&lt;br /&gt;Multiple compiler passes -- SSA construction, copy propagation, SSA destruction, and register allocation -- need to traverse over instructions and rename values. The first three replace uses of virtual registers with virtual registers, and the latter replaces virtual registers with physical registers prior to machine code generation. To avoid duplicating code, I wrote a utility which takes a hashtable of value mappings, and renames all values used in an instruction. There was a generic word with a method for each instruction. However this approach was insufficient for SSA construction, where the renaming set cannot be precomputed and instead changes at every instruction. So a better abstraction would be one that takes a quotation to apply to each value. However, dynamically calling a quotation for every slot of every instruction would be expensive, and inlining the entire renaming combinator at every use would be impractical. Instead, I used Factor's "functors" feature to write a parsing word which generates renaming logic customized to a particular use. This is defined in the &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=tree;f=basis/compiler/cfg/renaming/functor;hb=HEAD"&gt;compiler.cfg.renaming.functor&lt;/a&gt; vocabulary, and one usage among many is in &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=tree;f=basis/compiler/cfg/renaming;hb=HEAD"&gt;compiler.cfg.renaming&lt;/a&gt;, which implements the old behavior whereby a single hashtable of renamings was input. Here is how this utility is used in SSA construction for instance:&lt;br /&gt;&lt;pre&gt;RENAMING: ssa-rename [ gen-name ] [ top-name ] [ ]&lt;/pre&gt;&lt;br /&gt;This defines words &lt;code&gt;ssa-rename-defs&lt;/code&gt;, &lt;code&gt;ssa-rename-uses&lt;/code&gt; and &lt;code&gt;ssa-rename-temps&lt;/code&gt; (the latter being a no-op). The first two apply &lt;code&gt;gen-name&lt;/code&gt; and &lt;code&gt;top-name&lt;/code&gt; to each value to compute the new value.&lt;br /&gt;&lt;br /&gt;The register allocator does something similar in &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=tree;f=basis/compiler/cfg/linear-scan/assignment;hb=HEAD"&gt;compiler.cfg.linear-scan.assignment&lt;/a&gt;:&lt;br /&gt;&lt;pre&gt;RENAMING: assign [ vreg&gt;reg ] [ vreg&gt;reg ] [ vreg&gt;reg ]&lt;/pre&gt;&lt;br /&gt;Here, words named &lt;code&gt;assign-defs&lt;/code&gt;, &lt;code&gt;assign-uses&lt;/code&gt; and &lt;code&gt;assign-temps&lt;/code&gt; are defined, and they all perform the same operation on each instructions' defined values, used values and temporary values, respectively, by looking up the physical register assigned to each virtual register. This and other optimizations to linear scan significantly improved compile time, to the point where it has almost gone down to what it was before I started adding these fancy new optimizations.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Computing the dominator tree&lt;/h3&gt;&lt;br /&gt;The &lt;a href="http://en.wikipedia.org/wiki/Dominator_(graph_theory)"&gt;dominator tree&lt;/a&gt; is a useful concept for optimizations which need to take global control flow into account. At this point, I'm using them for SSA construction and destruction, but there are many other optimizations which depend on dominance information. The classical approach for computing dominance uses iterative dataflow analysis, but a better approach is given in the paper &lt;a href="http://www.cs.rice.edu/~keith/EMBED/dom.pdf"&gt;A Simple, Fast Dominance Algorithm&lt;/a&gt; by Keith D. Cooper, Timothy J. Harvey, and Ken Kennedy. I implemented the dominator tree computation described by the paper in the &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=tree;f=basis/compiler/cfg/dominance;hb=HEAD"&gt;compiler.cfg.dominance&lt;/a&gt; vocabulary. The dominator tree is a tree with basic blocks as nodes; the entry block of the CFG as its root.&lt;br /&gt;&lt;br /&gt;There are three operations that need to be performed, and this influences the tree representation:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Getting a node's parent in the tree, often referred to as the &lt;i&gt;immediate dominator&lt;/i&gt; (sometimes, a convention is used where the immediate dominator of the root node is the root node itself).&lt;/li&gt;&lt;li&gt;Getting a node's children.&lt;/li&gt;&lt;li&gt;Testing if one node is an ancestor of another node (in this case we say that the first node &lt;i&gt;dominates&lt;/i&gt; the second).&lt;/li&gt;&lt;/ol&gt;&lt;br /&gt;The algorithm in the paper computes a mapping from basic blocks to basic blocks, which gives the immediate dominator of every basic block. This lets us solve #1 in constant time. Since many algorithms need to look up children, I invert the immediate dominator mapping and turn it into a multimap (which in Factor, we just represent as an assoc from keys to sequences of values; the &lt;a href="http://docs.factorcode.org/content/word-push-at%2cassocs.html"&gt;push-at&lt;/a&gt; utility word is helpful for constructing these). This gives us #2. For #3, the obvious approach is to walk up the tree from the second node, checking if the node at each step is the first node. However, this takes time proportional to the height of the tree, and the worst case is straight-line code with little control flow, where the dominator tree becomes a linked list. Another approach is to compute a hashed set of dominators for each node, which gives us constant time dominance checks, but the space usage is quadratic in the worst case so that's not ideal either.&lt;br /&gt;&lt;br /&gt;A much better trick can be found in the paper on the SSA destruction algorithm which I will talk about below. I believe this technique goes back to the 70's and it is widely-known, but I did not learn about it until now. First, you perform a depth-first traversal of the dominator tree, incrementing a counter at each step. The first time you visit a node (on the way down), you assign the current counter value to the node's preorder value. The second time you visit a node (on the way up), you assign the current counter value to the node's maxpreorder value. What this does is number the nodes in preorder, and the maxpreorder of each node is the maximum of the preorder numbers of its children. Once these numbers have been precomputed, dominance checking can be done in constant time using the identity:&lt;br /&gt;&lt;pre&gt;A dominates B iff preorder(A) &gt;= preorder(B) &amp;amp; preorder(A) &lt;= maxpreorder(B)&lt;/pre&gt;&lt;br /&gt;Of course, this works for any tree, and if you plan on doing many repeated child-of tests, it is worth precomputing the pre/maxpre numbers for each node in this manner. This addresses #3 above.&lt;br /&gt;&lt;br /&gt;Here is a control flow graph, with the numbers denoting &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=tree;f=basis/compiler/cfg/rpo;hb=HEAD"&gt;reverse post order&lt;/a&gt; on basic blocks:&lt;br /&gt;&lt;img src="http://factorcode.org/array-any-cfg.png"&gt;&lt;br /&gt;and here is the corresponding dominator tree:&lt;br /&gt;&lt;img src="http://factorcode.org/array-any-dom.png"&gt;&lt;br /&gt;These diagrams were generated using the &lt;a href="http://www.graphviz.org/"&gt;Graphviz&lt;/a&gt; tool together with the &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=tree;f=extra/compiler/cfg/graphviz;hb=HEAD"&gt;compiler.cfg.graphviz&lt;/a&gt; vocabulary).&lt;br /&gt;&lt;br /&gt;The paper also gives an efficient algorithm for computing dominance frontiers, but I do not need those, for reasons given in the next section.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;SSA construction&lt;/h3&gt;&lt;br /&gt;The classical approach for constructing &lt;a href="http://en.wikipedia.org/wiki/Static_single_assignment_form"&gt;SSA form&lt;/a&gt; yields what is known as &lt;i&gt;minimal SSA&lt;/i&gt; involves three steps:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Computing dominance frontiers for each basic block in the control flow graph&lt;/li&gt;&lt;li&gt;For every value, take the set of basic blocks which have a definition of the value, compute the &lt;i&gt;iterated&lt;/i&gt; dominance frontier of this set, and insert a phi instruction (with dummy inputs) in each member of the set&lt;/li&gt;&lt;li&gt;Walk the dominator tree, renaming definitions and usages in order to enforce single static assignment, updating phi instruction inputs along the way.&lt;/li&gt;&lt;/ol&gt;&lt;br /&gt;This approach has two three problems:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Computing iterated dominance frontiers is expensive, and this is done for every value defined in more than one block&lt;/li&gt;&lt;li&gt;Too many phi instructions are inserted, and most end up being eliminated as dead code later&lt;/li&gt;&lt;li&gt;The renaming algorithm, as originally specified, requires too much work to be done on the walk back up the dominator tree, with each block's instructions being traversed both on the way down and on the way up&lt;/li&gt;&lt;/ol&gt;Nevertheless, the algorithm is simple and easy to understand. It is explained in the original paper on SSA form, &lt;a href="http://citeseer.ist.psu.edu/cytron91efficiently.html"&gt;Efficiently computing static single assignment form and the control dependence graph&lt;/a&gt;; straightforward pseudocode can be found in these &lt;a href="http://llvm.cs.uiuc.edu/~vadve/CS526/public_html/Notes/4ssa.4up.pdf"&gt;lecture notes&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;The so-called "pruned SSA form" addresses the issue of too many phi instructions being inserted. It is a minor modification of minimal SSA construction. Prior to inserting phi instructions, liveness information is computed for each block. Then, a phi instruction is only inserted for a value if the value is live into the block. Computing liveness is somewhat expensive, and the so-called "semi-pruned SSA form" uses a simple heuristic to approximate liveness; phi nodes are only inserted for values which are used in blocks other than those they are defined in.&lt;br /&gt;&lt;br /&gt;An algorithm for computing iterated dominance frontiers which does not require dominance frontiers to be computed first was described in the paper &lt;a href="http://portal.acm.org/citation.cfm?id=1065887.1065890"&gt;A Practical and Fast Iterative Algorithm for Phi-Function Computation Using DJ Graphs&lt;/a&gt; by &lt;br /&gt;Dibyendu Das and U. Ramakrishna.&lt;br /&gt;&lt;br /&gt;Finally, the paper introducing semi-pruned SSA form, titled &lt;a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.49.9683"&gt;Practical Improvements to the Construction and Destruction of Static Single Assignment Form&lt;/a&gt;, proposes a slightly more efficient renaming algorithm.&lt;br /&gt;&lt;br /&gt;So the SSA construction algorithm I implemented in the &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=tree;f=basis/compiler/cfg/ssa/construction;hb=HEAD"&gt;compiler.cfg.ssa.construction&lt;/a&gt; vocabulary is a combination of these three papers. First, I compute merge sets using the DJ-Graph algorithm, then, I use liveness information for placing phi instructions, and finally, I use the improved renaming algorithm.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;SSA destruction&lt;/h3&gt;&lt;br /&gt;To get out of SSA form and back to imperative code which can be executed by a machine, phi instructions must be eliminated. The approach originally described by Cytron et al gives incorrect results in many cases; the correct algorithm is well-known now but it introduces many additional copy instructions. The classical approach for eliminating copies ("copy coalescing") is to do it as part of register allocation; if two values are related by a copy but do not otherwise interfere, they can be assigned to the same physical register, and the copy can be deleted from the program. This works for a graph-coloring approach, but with linear scan, you're limited in how much global knowledge you can have while performing register allocation, and accurate interference information is difficult to obtain.&lt;br /&gt;&lt;br /&gt;Factor's register allocation performs some basic coalescing, mostly to eliminate copies arising from conversion to two-operand form on x86. However, phi nodes introduce copies with more complex interferences and my approach didn't work there, so even though stack analysis eliminated many memory to register and register to memory moves, the generated code had a large number of register to register moves, which is a bottleneck for instruction decoding, not to mention wastes valuable cache space.&lt;br /&gt;&lt;br /&gt;Instead of attempting to coalesce copies arising from phi instructions in the register allocator, a more modern approach is to do this as part of SSA destruction -- instead of converting phi instructions to copies, the goal is to avoid inserting as many copies as possible in the first place.&lt;br /&gt;&lt;br /&gt;A nice algorithm for SSA destruction with coalescing is detailed in the paper &lt;a href="http://portal.acm.org/citation.cfm?id=512529.512534"&gt;Fast copy coalescing and live-range identification&lt;/a&gt;, by Zoran Budimlic et al. The algorithm is very clever -- the two key results are a constant-time interference test between SSA values using dominance and live range information, and a linear-time interference test in a set of variables using "dominance forests", which are easy to construct and enable you to rule out most pairs of values for interference tests altogether.&lt;br /&gt;&lt;br /&gt;This algorithm is implemented in &lt;a href="http://llvm.org"&gt;LLVM&lt;/a&gt; (&lt;code&gt;lib/CodeGen/StrongPHIElimination.cpp&lt;/code&gt;) and I essentially ported the C++ code to Factor. You can find the code in the &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=tree;f=basis/compiler/cfg/ssa/destruction;hb=HEAD"&gt;compiler.cfg.ssa.destruction&lt;/a&gt; vocabulary.&lt;br /&gt;&lt;br /&gt;I tweaked the algorithm a little -- instead of inserting sequential copies, I insert parallel copies, as detailed in the paper &lt;a href="http://hal.archives-ouvertes.fr/docs/00/34/99/25/PDF/OutSSA-RR.pdf"&gt;Revisiting Out-of-SSA Translation for Correctness, Code Quality and Efficiency&lt;/a&gt; by Benoit Boissinot et al. The parallel copy algorithm is implemented in &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=tree;f=basis/compiler/cfg/parallel-copy;hb=HEAD"&gt;compiler.cfg.parallel-copy&lt;/a&gt;. I used it not only in SSA destruction, but also to clean up some hackish approximations of the same problem in global stack analysis and the linear scan register allocator's &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=tree;f=basis/compiler/cfg/linear-scan/resolve;hb=HEAD"&gt;resolve pass&lt;/a&gt;. I didn't implement the rest of the paper, because I found it hard to follow; it claims to provide a better algorithm than Budimlic et al, but the latter is good enough for now, and having a working implementation in the form of the LLVM pass was very valuable in implementing it in Factor.&lt;br /&gt;&lt;br /&gt;This pass was very effective in eliminating copy instructions; it generates 75% of copies than the old phi elimination pass, which simply used the naive algorithm.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Loop head rotation&lt;/h3&gt;&lt;br /&gt;The last optimization I added eliminates an unconditional branch in some loops. Consider the following CFG:&lt;br /&gt;&lt;img src="http://factorcode.org/fixnum-fast-times-cfg.png"&gt;&lt;br /&gt;A valid way to linearize the basic blocks is in reverse post order, that is 0, 1, 2, 3, 4, 5. However, with this linearization, block 3 has an unconditional branch back to block 2, and block 2 has a conditional which either falls through to 3 or jumps to 4. So on every loop iteration, two jumps are executed (the conditional jump at 2 and the unconditional jump at 3). If, instead, the CFG was linearized as 0, 1, 3, 2, 4, 5, then while 1 would have an unconditional jump to 2, 2 would have a conditional jump back to 3 and 3 would fall through to 2. So upon entry to the loop, an extra unconditional jump (from 1 to 3) executes, but on each iteration, there is just the single conditional backward jump at 2. This improves performance slightly and is easy to implement; the code is in the &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=tree;f=basis/compiler/cfg/linearization/order;hb=HEAD"&gt;compiler.cfg.linearization.order&lt;/a&gt; vocabulary, and I borrowed the algorithm from &lt;a href="http://sbcl.org"&gt;SBCL&lt;/a&gt;'s &lt;a href="http://sbcl.cvs.sourceforge.net/viewvc/sbcl/sbcl/src/compiler/control.lisp?revision=1.11&amp;view=markup"&gt;src/compiler/control.lisp&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Conclusion&lt;/h3&gt;&lt;br /&gt;There are some performance regressions I need to work out, because global stack analysis introduces too many partial redundancies for some types of code, and inline GC checks are currently disabled because of an unrelated issue I need to fix. It will take a few more days of tweaking to sort things out, and then I will post some benchmarks. Early results are already very promising on benchmarks with loops; not just the trivial counted loop example above.&lt;br /&gt;&lt;br /&gt;Next steps are global float unboxing, unboxed 32-bit and 64-bit integer arithmetic, and SSE intrinsics. To help with the latter, &lt;a href="http://duriansoftware.com"&gt;Joe Groff&lt;/a&gt; was kind enough to add support for all SSE1/2/3/4 instructions to &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=tree;f=basis/cpu/x86/assembler;hb=HEAD"&gt;Factor's x86 assembler&lt;/a&gt;. Finally, thanks to Cameron Zwarich for pointing me at some of the papers I linked to above.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17087850-7198889069604652714?l=factor-language.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://factor-language.blogspot.com/feeds/7198889069604652714/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17087850&amp;postID=7198889069604652714' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/7198889069604652714'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/7198889069604652714'/><link rel='alternate' type='text/html' href='http://factor-language.blogspot.com/2009/07/dataflow-analysis-computing-dominance.html' title='Dataflow analysis, computing dominance, and converting to and from SSA form'/><author><name>Slava Pestov</name><uri>http://www.blogger.com/profile/02768382790667979877</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17087850.post-3775441976278425459</id><published>2009-07-17T02:19:00.005-04:00</published><updated>2009-07-17T05:56:35.282-04:00</updated><title type='text'>Improved value numbering, branch splitting, and faster overflow checks</title><content type='html'>Hot on the heels on the &lt;a href="http://factor-language.blogspot.com/2009/07/improvements-to-factors-register.html"&gt;improved register allocator&lt;/a&gt;, I've added some new optimizations to the &lt;a href="http://factor-language.blogspot.com/2008/11/new-low-level-optimizer-and-code.html"&gt;low-level optimizer&lt;/a&gt; and made existing optimizations stronger.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Improved value numbering&lt;/h3&gt;&lt;br /&gt;Local value numbering now detects more congruences along values computed by &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=basis/compiler/cfg/instructions/instructions.factor;hb=HEAD"&gt;low level IR instructions&lt;/a&gt;. All the changes were either in the &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=basis/compiler/cfg/value-numbering/rewrite/rewrite.factor;hb=HEAD"&gt;rewrite&lt;/a&gt; or &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=basis/compiler/cfg/value-numbering/simplify/simplify.factor;hb=HEAD"&gt;simplify&lt;/a&gt; steps.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Some additional arithmetic identities. For instance, the result of adding or subtracting 0 to a value is congruent to the value itself. The high-level optimizer implements the same optimization on Factor's generic arithmetic words, but sometimes low-level optimizations introduce additional redundancies.&lt;/li&gt;&lt;li&gt;A &lt;code&gt;##add&lt;/code&gt; instruction where one of the two operands is the result of a &lt;code&gt;##load-immediate&lt;/code&gt; is now converted into a &lt;code&gt;##add-imm&lt;/code&gt;, saving a register. A similar identity holds for other commutative operations, such as &lt;code&gt;##mul&lt;/code&gt;, &lt;code&gt;##and&lt;/code&gt;, &lt;code&gt;##or&lt;/code&gt;, &lt;code&gt;##xor&lt;/code&gt;, and &lt;code&gt;##compare&lt;/code&gt; with a condition code of &lt;code&gt;cc=&lt;/code&gt;.&lt;/li&gt;&lt;li&gt;Constant folding. Arithmetic instructions where both operands are constants are replaced by a &lt;code&gt;##load-immediate&lt;/code&gt; containing the result. A special case is a subtraction where both operands are congruent; this is replaced by a load of 0. Again, the high-level optimizer performs constant folding also, but low-level optimizations sometimes introduce redundancies or expose opportunities for optimization which were not apparent in high-level IR.&lt;/li&gt;&lt;li&gt;Reassociation. If an &lt;code&gt;##add-imm&lt;/code&gt; instruction's first input is the result of another &lt;code&gt;##add-imm&lt;/code&gt;, then a new &lt;code&gt;##add-imm&lt;/code&gt; is made which computes the same result with a single addition. Algebraically, this corresponds to the identity &lt;code&gt;(x + a) + b = x + (a + b)&lt;/code&gt;, where &lt;code&gt;a&lt;/code&gt; and &lt;code&gt;b&lt;/code&gt; are known at compile-time. Similar transformations are made for other associative operations, such as &lt;code&gt;##sub&lt;/code&gt;, &lt;code&gt;##mul&lt;/code&gt;, &lt;code&gt;##and&lt;/code&gt;, &lt;code&gt;##or&lt;/code&gt; and &lt;code&gt;##xor&lt;/code&gt;.&lt;/li&gt;&lt;li&gt;Constant branch folding. If both inputs to a &lt;code&gt;##compare&lt;/code&gt; or &lt;code&gt;##compare-branch&lt;/code&gt; are constant, or if both are congruent to the same value, then the result of the comparison can be computed at compile-time, and one of the two successor blocks can be deleted.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;While the high-level optimizer performs constant folding and unreachable code elimination as part of the &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=basis/compiler/tree/propagation/propagation.factor;hb=HEAD"&gt;SCCP&lt;/a&gt; pass, low-level optimizations which are done before value numbering can expose optimization opportunities which were invisible in high-level IR. For example, consider the following code:&lt;br /&gt;&lt;code&gt;[ { vector } declare [ length ] [ length ] bi &amp;lt; ]&lt;/code&gt;&lt;br /&gt;Using the &lt;code&gt;optimized.&lt;/code&gt; word, we can see what the high-level optimizer transforms this into:&lt;br /&gt;&lt;pre&gt;[ dup &gt;R 3 slot R&gt; 3 slot fixnum&amp;lt; ]&lt;/pre&gt;&lt;br /&gt;The high-level optimizer only attempts to reason about immutable slots, but a vector's length is mutable since a vector may have elements added to it, and so the high-level optimizer cannot detect that the two inputs to &lt;code&gt;fixnum&amp;lt;&lt;/code&gt; are equal. A conversion into low-level IR, the code becomes the following sequence of &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=basis/compiler/cfg/instructions/instructions.factor;hb=HEAD"&gt;SSA instructions&lt;/a&gt;:&lt;br /&gt;&lt;pre&gt;_label 0 f &lt;br /&gt;##prologue f &lt;br /&gt;_label 1 f &lt;br /&gt;##peek V int-regs 1 D 0 f &lt;br /&gt;##copy V int-regs 4 V int-regs 1 f &lt;br /&gt;##slot-imm V int-regs 5 V int-regs 4 3 7 f &lt;br /&gt;##copy V int-regs 8 V int-regs 1 f &lt;br /&gt;##slot-imm V int-regs 9 V int-regs 8 3 7 f &lt;br /&gt;##copy V int-regs 10 V int-regs 5 f &lt;br /&gt;##copy V int-regs 11 V int-regs 9 f &lt;br /&gt;##compare V int-regs 12 V int-regs 10 V int-regs 11 cc&amp;lt; V int-regs 13 f &lt;br /&gt;##replace V int-regs 12 D 0 f &lt;br /&gt;##epilogue f &lt;br /&gt;##return f &lt;br /&gt;_spill-counts f f&lt;/pre&gt;&lt;br /&gt;Now, the &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=basis/compiler/cfg/alias-analysis/alias-analysis.factor;hb=HEAD"&gt;low-level alias analysis optimization&lt;/a&gt; is able to detect that the second &lt;code&gt;##slot-imm&lt;/code&gt; is redundant, and it transforms the code into the following:&lt;br /&gt;&lt;pre&gt;_label 0 f &lt;br /&gt;##prologue f &lt;br /&gt;_label 1 f &lt;br /&gt;##peek V int-regs 1 D 0 f &lt;br /&gt;##copy V int-regs 4 V int-regs 1 f &lt;br /&gt;##slot-imm V int-regs 5 V int-regs 4 3 7 f &lt;br /&gt;##copy V int-regs 9 V int-regs 5 f &lt;br /&gt;##copy V int-regs 10 V int-regs 5 f &lt;br /&gt;##copy V int-regs 11 V int-regs 9 f &lt;br /&gt;##compare V int-regs 12 V int-regs 10 V int-regs 11 cc&amp;lt; V int-regs 13 f &lt;br /&gt;##replace V int-regs 12 D 0 f &lt;br /&gt;##epilogue f &lt;br /&gt;##return f &lt;br /&gt;_spill-counts f f&lt;/pre&gt;&lt;br /&gt;Now, notice that both inputs to the &lt;code&gt;##compare&lt;/code&gt; instruction (vreg #10 and vreg #11) are copies of the same value, vreg #5. Value numbering detects this congruence since copy propagation is just a special case of value numbering; it then simplifies the comparison down to a constant load of &lt;code&gt;f&lt;/code&gt;, since for any integer &lt;code&gt;x&lt;/code&gt;, we have &lt;code&gt;x &amp;lt; x =&gt; false&lt;/code&gt;. After dead code elimination, the result is the following:&lt;br /&gt;&lt;pre&gt;_label 0 f &lt;br /&gt;##prologue f &lt;br /&gt;_label 1 f &lt;br /&gt;##load-immediate V int-regs 12 5 f &lt;br /&gt;##replace V int-regs 12 D 0 f &lt;br /&gt;##epilogue f &lt;br /&gt;##return f &lt;br /&gt;_spill-counts f f&lt;/pre&gt;&lt;br /&gt;While no programmer would explicitly write such redundant code, it comes up after inlining. For example, the &lt;a href="http://docs.factorcode.org/content/word-push%2csequences.html"&gt;push&lt;/a&gt; word, which adds an element at the end of a sequence, is implemented as follows:&lt;br /&gt;&lt;pre&gt;: push ( elt seq -- ) [ length ] [ set-nth ] bi ;&lt;br /&gt;&lt;br /&gt;HINTS: push { vector } { sbuf } ;&lt;/pre&gt;&lt;br /&gt;The &lt;a href="http://docs.factorcode.org/content/article-hints.html"&gt;hints&lt;/a&gt; tell the compiler to compile specialized versions of this word for vectors and string buffers. While the generic versions work on any sequence, the hinted versions are used when the input types match, and the result can be better performance if generic words are inlined (in this case, &lt;a href="http://docs.factorcode.org/content/word-length%2csequences.html"&gt;length&lt;/a&gt; and &lt;a href="http://docs.factorcode.org/content/word-set-nth%2csequences.html"&gt;set-nth&lt;/a&gt;). Now, after a bunch of code is inlined, a piece of code comes up which compares the insertion index against the vector's length; however, the insertion index &lt;i&gt;is&lt;/i&gt; the length in this case, and so value numbering is now able to optimize out a conditional which could not be optimized out before. Of course I could have just written a specialized version of &lt;code&gt;push&lt;/code&gt; for vectors (or even built it into the VM) but I'd rather implement general optimizations and write my code in a high-level style.&lt;br /&gt;&lt;br /&gt;I'd like to thank &lt;a href="http://docs.factorcode.org/content/author-Doug%20Coleman.html"&gt;Doug Coleman&lt;/a&gt; for implementing some of the above improvements.&lt;br /&gt;&lt;h3&gt;Branch splitting&lt;/h3&gt;&lt;br /&gt;Roughly speaking, branch splitting is the following transform:&lt;br /&gt;&lt;pre&gt;[ A [ B ] [ C ] if D ] =&gt; [ A [ B D ] [ C D ] if ]&lt;/pre&gt;&lt;br /&gt;It is only run if D is a small amount of code. Branch splitting runs before value numbering, so value numbering's constant branch folding can help simplify the resulting control flow. Branch splitting is implemented in the &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=basis/compiler/cfg/branch-splitting/branch-splitting.factor;hb=HEAD"&gt;compiler.cfg.branch-splitting&lt;/a&gt; vocabulary.&lt;br /&gt;&lt;br /&gt;For example, suppose we have the following code:&lt;br /&gt;&lt;pre&gt;[ [ t ] [ f ] if [ 1 ] [ 2 ] if ]&lt;/pre&gt;&lt;br /&gt;The high-level optimizer is unable to simplify this further, since it does not work on a control flow graph. The low level optimizer builds a control flow graph with the following shape:&lt;br /&gt;&lt;img src="http://factorcode.org/cfg-before-splitting.png"&gt;&lt;br /&gt;The basic block in the middle is a candidate for splitting since it has two predecessors, and it is short. After splitting, the result looks like so:&lt;br /&gt;&lt;img src="http://factorcode.org/cfg-after-splitting.png"&gt;&lt;br /&gt;Now, at this point, the only thing that this has achieved is eliminating one unconditional jump at the expense of code size. However, after stack analysis and value numbering run, some redundant work is eliminated:&lt;br /&gt;&lt;img src="http://factorcode.org/cfg-after-optimization.png"&gt;&lt;br /&gt;One example of real code that benefits from branch splitting includes code that calls the &lt;code&gt;=&lt;/code&gt; word followed by a conditional. The &lt;code&gt;=&lt;/code&gt; word is inlined, and it has a conditional inside of it; some branches of the conditional push constants &lt;code&gt;t&lt;/code&gt; and &lt;code&gt;f&lt;/code&gt;, and others call non-inline words. The result is that the branches which push a constant boolean can directly jump to the destination block without the overhead of storing and testing a boolean value.&lt;br /&gt;&lt;br /&gt;Once again thanks to &lt;a href="http://docs.factorcode.org/content/author-Doug%20Coleman.html"&gt;Doug Coleman&lt;/a&gt; for collaborating with me on the implementation of branch splitting.&lt;br /&gt;&lt;h3&gt;Block joining&lt;/h3&gt;&lt;br /&gt;Block joining is a simple optimization which merges basic blocks connected by a single control flow edge. Various low-level optimizations leave the control flow graph with a large number of empty or very small blocks with no conditional control flow between them. Block joining helps improve effectiveness of local optimizations by creating larger basic blocks.&lt;br /&gt;&lt;br /&gt;In the above example for branch splitting, block joining will eliminate the four empty basic blocks that remain after optimization.&lt;br /&gt;&lt;br /&gt;Block joining is implemented in the &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=basis/compiler/cfg/block-joining/block-joining.factor;hb=HEAD"&gt;compiler.cfg.block-joining&lt;/a&gt; vocabulary.&lt;br /&gt;&lt;h3&gt;Faster overflowing integer arithmetic intrinsics&lt;/h3&gt;&lt;br /&gt;In the lucky cases where the compiler can eliminate overflow checks, the various math operations become single instructions in the low-level IR which eventually map directly to machine instructions. In the general case, however, an overflow check has to be generated. In the event of overflow, a bignum is allocated, which may in turn involve running the garbage collector, so it is quite an expensive operation compared to a single machine arithmetic operation.&lt;br /&gt;&lt;br /&gt;Previously, the overflowing fixnum operations were represented in low-level IR as single, indivisible units, and just like subroutine calls and returns, they were "sync points" which meant that all values in registers had to be saved to the data and retain stacks before, and reloaded after. This is because of how these operations were implemented; they would perform the arithmetic, do an overflow check, and in the event of overflow, they would invoke a subroutine which handled the overflow. Keeping registers live across this operation was not sound in the event of overflow, since the subroutine call would clobber them.&lt;br /&gt;&lt;br /&gt;No longer. Now, an overflowing fixnum operation breaks down into several nodes in the control flow graph, and the arithmetic part, the overflow check and the bignum allocation are represented separately. In particular, the code to save registers to the stack and reload them after is only generated in the overflow case, so in the event of no overflow, which happens much more frequently, execution can "fall through" and continue using the same registers as before.&lt;br /&gt;&lt;br /&gt;The code that generates low-level instructions and control flow graph nodes for overflowing fixnum operations is defined in the &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=basis/compiler/cfg/intrinsics/fixnum/fixnum.factor;hb=HEAD"&gt;compiler.cfg.intrinsics.fixnum&lt;/a&gt; vocabulary.&lt;br /&gt;&lt;h3&gt;Faster integer shifts&lt;/h3&gt;&lt;br /&gt;This last optimization is a very minor one, but it made a difference on benchmarks. Previously shifts with a constant shift count and no overflow check would compile as a machine instruction, and all other forms of shifts would invoke subroutine calls. Now, machine instructions are generated in the case where the shift count is unknown, but the result is still known to be small enough not to require an overflow check.&lt;br /&gt;&lt;h3&gt;Benchmark results&lt;/h3&gt;&lt;br /&gt;Here are the results of running &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=tree;f=extra/benchmark;hb=HEAD"&gt;Factor's benchmark suite&lt;/a&gt; on a build from 31st May 2009, before I started working on the current set of low-level optimizer improvements (which includes the register allocator I blogged about previously), and from today.&lt;br /&gt;&lt;table&gt; &lt;tr&gt;&lt;th&gt;Benchmark&lt;/th&gt;&lt;th&gt;Before&lt;/th&gt;&lt;th&gt;After&lt;/th&gt;&lt;tr&gt;&lt;td&gt;benchmark.backtrack          &lt;/td&gt;&lt;td&gt;1.767561         &lt;/td&gt;&lt;td&gt;1.330641           &lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.base64             &lt;/td&gt;&lt;td&gt;1.997951         &lt;/td&gt;&lt;td&gt;1.738677           &lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.beust1             &lt;/td&gt;&lt;td&gt;2.765257         &lt;/td&gt;&lt;td&gt;2.461088           &lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.beust2             &lt;/td&gt;&lt;td&gt;3.584958         &lt;/td&gt;&lt;td&gt;1.694427           &lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.binary-search      &lt;/td&gt;&lt;td&gt;1.55002          &lt;/td&gt;&lt;td&gt;1.574595           &lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.binary-trees       &lt;/td&gt;&lt;td&gt;1.845798         &lt;/td&gt;&lt;td&gt;1.733355           &lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.bootstrap1         &lt;/td&gt;&lt;td&gt;10.860492        &lt;/td&gt;&lt;td&gt;11.447687          &lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.dawes              &lt;/td&gt;&lt;td&gt;0.229999         &lt;/td&gt;&lt;td&gt;0.161726           &lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.dispatch1          &lt;/td&gt;&lt;td&gt;2.015653         &lt;/td&gt;&lt;td&gt;2.119268           &lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.dispatch2          &lt;/td&gt;&lt;td&gt;1.817941         &lt;/td&gt;&lt;td&gt;1.216618           &lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.dispatch3          &lt;/td&gt;&lt;td&gt;2.568987         &lt;/td&gt;&lt;td&gt;1.899128           &lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.dispatch4          &lt;/td&gt;&lt;td&gt;2.319587         &lt;/td&gt;&lt;td&gt;2.032182           &lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.dispatch5          &lt;/td&gt;&lt;td&gt;2.346744         &lt;/td&gt;&lt;td&gt;1.614045           &lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.empty-loop-0       &lt;/td&gt;&lt;td&gt;0.146716         &lt;/td&gt;&lt;td&gt;0.12589            &lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.empty-loop-1       &lt;/td&gt;&lt;td&gt;0.430314         &lt;/td&gt;&lt;td&gt;0.342426           &lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.empty-loop-2       &lt;/td&gt;&lt;td&gt;0.429012         &lt;/td&gt;&lt;td&gt;0.342097           &lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.euler150           &lt;/td&gt;&lt;td&gt;16.901451        &lt;/td&gt;&lt;td&gt;15.288867          &lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.euler186           &lt;/td&gt;&lt;td&gt;8.805434999999999&lt;/td&gt;&lt;td&gt;7.920478           &lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.fannkuch           &lt;/td&gt;&lt;td&gt;3.202698         &lt;/td&gt;&lt;td&gt;2.964037           &lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.fasta              &lt;/td&gt;&lt;td&gt;5.52608          &lt;/td&gt;&lt;td&gt;4.934112           &lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.gc0                &lt;/td&gt;&lt;td&gt;2.15066          &lt;/td&gt;&lt;td&gt;1.993158           &lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.gc1                &lt;/td&gt;&lt;td&gt;4.984841         &lt;/td&gt;&lt;td&gt;4.961272           &lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.gc2                &lt;/td&gt;&lt;td&gt;3.327706         &lt;/td&gt;&lt;td&gt;3.265462           &lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.iteration          &lt;/td&gt;&lt;td&gt;3.736756         &lt;/td&gt;&lt;td&gt;3.30438            &lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.javascript         &lt;/td&gt;&lt;td&gt;9.79904          &lt;/td&gt;&lt;td&gt;9.164517           &lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.knucleotide        &lt;/td&gt;&lt;td&gt;0.282296         &lt;/td&gt;&lt;td&gt;0.251879           &lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.mandel             &lt;/td&gt;&lt;td&gt;0.125304         &lt;/td&gt;&lt;td&gt;0.123945           &lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.md5                &lt;/td&gt;&lt;td&gt;0.946516         &lt;/td&gt;&lt;td&gt;0.85062            &lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.nbody              &lt;/td&gt;&lt;td&gt;3.982774         &lt;/td&gt;&lt;td&gt;3.349595           &lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.nested-empty-loop-1&lt;/td&gt;&lt;td&gt;0.116351         &lt;/td&gt;&lt;td&gt;0.135936           &lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.nested-empty-loop-2&lt;/td&gt;&lt;td&gt;0.692668         &lt;/td&gt;&lt;td&gt;0.438932           &lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.nsieve             &lt;/td&gt;&lt;td&gt;0.714772         &lt;/td&gt;&lt;td&gt;0.698262           &lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.nsieve-bits        &lt;/td&gt;&lt;td&gt;1.451828         &lt;/td&gt;&lt;td&gt;0.907247           &lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.nsieve-bytes       &lt;/td&gt;&lt;td&gt;0.312481         &lt;/td&gt;&lt;td&gt;0.300053           &lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.partial-sums       &lt;/td&gt;&lt;td&gt;1.205072         &lt;/td&gt;&lt;td&gt;1.221245           &lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.pidigits           &lt;/td&gt;&lt;td&gt;16.088346        &lt;/td&gt;&lt;td&gt;16.159176          &lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.random             &lt;/td&gt;&lt;td&gt;2.574773         &lt;/td&gt;&lt;td&gt;2.706893           &lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.raytracer          &lt;/td&gt;&lt;td&gt;3.481714         &lt;/td&gt;&lt;td&gt;2.914116           &lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.recursive          &lt;/td&gt;&lt;td&gt;5.964279         &lt;/td&gt;&lt;td&gt;3.215277           &lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.regex-dna          &lt;/td&gt;&lt;td&gt;0.132406         &lt;/td&gt;&lt;td&gt;0.093095           &lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.reverse-complement &lt;/td&gt;&lt;td&gt;3.811822         &lt;/td&gt;&lt;td&gt;3.257535           &lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.ring               &lt;/td&gt;&lt;td&gt;1.756481         &lt;/td&gt;&lt;td&gt;1.79823            &lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.sha1               &lt;/td&gt;&lt;td&gt;2.267648         &lt;/td&gt;&lt;td&gt;1.473887           &lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.sockets            &lt;/td&gt;&lt;td&gt;8.794280000000001&lt;/td&gt;&lt;td&gt;8.783398           &lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.sort               &lt;/td&gt;&lt;td&gt;0.421628         &lt;/td&gt;&lt;td&gt;0.363383           &lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.spectral-norm      &lt;/td&gt;&lt;td&gt;3.830249         &lt;/td&gt;&lt;td&gt;4.036353           &lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.stack              &lt;/td&gt;&lt;td&gt;2.086594         &lt;/td&gt;&lt;td&gt;1.014408           &lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.sum-file           &lt;/td&gt;&lt;td&gt;0.528061         &lt;/td&gt;&lt;td&gt;0.422194           &lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.tuple-arrays       &lt;/td&gt;&lt;td&gt;0.127335         &lt;/td&gt;&lt;td&gt;0.103421           &lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.typecheck1         &lt;/td&gt;&lt;td&gt;0.876559         &lt;/td&gt;&lt;td&gt;0.6723440000000001 &lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.typecheck2         &lt;/td&gt;&lt;td&gt;0.878561         &lt;/td&gt;&lt;td&gt;0.671624           &lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.typecheck3         &lt;/td&gt;&lt;td&gt;0.86596          &lt;/td&gt;&lt;td&gt;0.670099           &lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.ui-panes           &lt;/td&gt;&lt;td&gt;0.426701         &lt;/td&gt;&lt;td&gt;0.372301           &lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;benchmark.xml                &lt;/td&gt;&lt;td&gt;2.351934         &lt;/td&gt;&lt;td&gt;2.187999           &lt;/td&gt;&lt;/tr&gt; &lt;/table&gt;&lt;br /&gt;There are a couple of regressions I need to look into but for the most part the results look pretty good. I also expect further gains to come with additional optimizations that I plan on implementing.&lt;br /&gt;&lt;h3&gt;Next steps&lt;/h3&gt;&lt;br /&gt;I'm going to keep working on the code generator for another little while. First of all, compile time has increased so that needs to improve. Next, I'm going to implement better global optimization. At this point, values are stored in register between basic blocks, unlike with the old code generator. However, loop variables are still stored on the stack between iterations because the analysis does not handle back edges yet. Fixing this, and making float unboxing work globally, is my next step. After that, I plan on adding support for unboxed word-size integers (32 or 64-bits, depending on platform) and some intrinsics for SSE2 vector operations on Intel CPUs. Next, the register allocator needs improved coalescing logic, and it also needs to support fixed register assignments as found on some x86 instructions which take implicit operands. Finally, I have a few optimizations I want to add to the &lt;a href="http://factor-language.blogspot.com/2008/08/new-optimizer.html"&gt;high level optimizer&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17087850-3775441976278425459?l=factor-language.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://factor-language.blogspot.com/feeds/3775441976278425459/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17087850&amp;postID=3775441976278425459' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/3775441976278425459'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/3775441976278425459'/><link rel='alternate' type='text/html' href='http://factor-language.blogspot.com/2009/07/improved-value-numbering-branch.html' title='Improved value numbering, branch splitting, and faster overflow checks'/><author><name>Slava Pestov</name><uri>http://www.blogger.com/profile/02768382790667979877</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17087850.post-6903509954500347908</id><published>2009-07-09T19:49:00.007-04:00</published><updated>2009-07-09T21:52:13.907-04:00</updated><title type='text'>Improvements to Factor's register allocator</title><content type='html'>For the last couple of months I've been working on some improvements to Factor's &lt;a href="http://factor-language.blogspot.com/2008/11/new-low-level-optimizer-and-code.html"&gt;low-level optimizer and code generator&lt;/a&gt;. The major new change is that the control flow graph IR is now allowed to define registers that are used in more than one basic block. Previously, all values would be loaded from the data stack at the start of a basic block and stored to the stack at the end. Register allocation in this case is called local register allocation. Now the only time when values have to be saved to the stack is for subroutine calls. The part of the compiler most affected by this is of course the register allocator, because now it needs to do global register allocation.&lt;br /&gt;&lt;br /&gt;I will describe the new optimizations in a future blog post and talk about global register allocation here. To help people who are learning Factor or are simply curious about what real Factor code looks like, I've inserted lots of links from this blog post to our &lt;a href="http://gitweb.factorcode.org"&gt;online code repository browser&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Instead of rewriting the register allocator from scratch, I took a more incremental approach, gradually refactoring it to operate on the entire &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=basis/compiler/cfg/cfg.factor;hb=HEAD"&gt;control flow graph&lt;/a&gt; instead of a basic block at a time. Generalizing the linear scan algorithm to do this is relatively straightforward, but there are some subtleties. Once again, I used &lt;a href="http://www.ssw.uni-linz.ac.at/Research/Papers/Wimmer04Master/"&gt;Christian Wimmer&lt;/a&gt;'s masters thesis as a guide, however this time around I implemented almost the entire algorithm described there, instead of the simplification to the local case.&lt;br /&gt;&lt;br /&gt;First, recall some terminology: &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=basis/compiler/cfg/registers/registers.factor;hb=HEAD"&gt;virtual registers&lt;/a&gt; are an abstraction of an ideal CPUs register file: there can be arbitrarily many of them, and each one is only ever assigned to once. The register allocator's job is to rewrite the program to use physical registers instead, possibly mapping a number of virtual registers to the same physical register, and inserting spills and reloads if there are insufficient physical registers to store all temporary values used by the program.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Linearization&lt;/h3&gt;&lt;br /&gt;A local register allocator operates on a basic block, which is a linear list of instructions. For the global case, linear scan still requires a linear list of instructions as input, so the first step is to traverse the CFG in &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=basis/compiler/cfg/rpo/rpo.factor;hb=HEAD"&gt;reverse post order&lt;/a&gt; and &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=basis/compiler/cfg/linear-scan/numbering/numbering.factor;hb=HEAD"&gt;number the instructions&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Building live intervals&lt;/h3&gt;&lt;br /&gt;Once a linear ordering among all instructions has been stablished, the second step in linear scan register allocation is &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=basis/compiler/cfg/linear-scan/live-intervals/live-intervals.factor;hb=HEAD"&gt;building live intervals&lt;/a&gt;. In the local case, a virtual register is defined in the same basic block as all of its usages. Furthermore, if we assume single assignment, then there will be only a single definition, and it will precede all uses. So the set of instructions where the register is live is a single interval, which begins at the definition and ends at the last use.&lt;br /&gt;&lt;br /&gt;In the global case, the situation is more complicated. First of all, because the linearization chosen is essentially arbitrary, and does not reflect the actual control flow in reality, instructions are not necessarily executed in order, so if a register is defined at position A and used at position B, it does not mean that the register is live at every position between A and B. Indeed, in the global case, the live "interval" of a register may consist of a number of disjoint ranges of instructions.&lt;br /&gt;&lt;br /&gt;As an example, consider the following C program:&lt;br /&gt;&lt;pre&gt;1: int x = ...;&lt;br /&gt;2: if(...) {&lt;br /&gt;3:     ... a bunch of code that does not use x&lt;br /&gt;4: } else {&lt;br /&gt;5:     int y = x + 1;&lt;br /&gt;6:     ... code that doesn't use x&lt;br /&gt;7: }&lt;br /&gt;8: ... more code that doesn't use x&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;The value &lt;code&gt;x&lt;/code&gt; is live at position 1, and also at position 5. The last usage of &lt;code&gt;x&lt;/code&gt; is at line 5, so it is not live at any position after 5, but also, it is not live at position 3 either, since if the true branch is taken, then &lt;code&gt;x&lt;/code&gt; is never used at all. We say that &lt;code&gt;x&lt;/code&gt; has a &lt;i&gt;lifetime hole&lt;/i&gt; at position 3.&lt;br /&gt;&lt;br /&gt;Another case where lifetime holes come up is virtual registers with multiple definitions. The very last step in &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=basis/compiler/cfg/optimizer/optimizer.factor;hb=HEAD"&gt;the low level optimizer&lt;/a&gt;, right before register allocation, is &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=basis/compiler/cfg/phi-elimination/phi-elimination.factor;hb=HEAD"&gt;phi node elimination&lt;/a&gt;. Essentially, phi node elimination turns this,&lt;br /&gt;&lt;pre&gt;if(informal) {&lt;br /&gt;    x1 = "hi";&lt;br /&gt;} else {&lt;br /&gt;    x2 = "hello";&lt;br /&gt;}&lt;br /&gt;x = phi(x1,x2);&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;into the following:&lt;br /&gt;&lt;pre&gt;if(informal) {&lt;br /&gt;    x1 = "hi";&lt;br /&gt;    x = x1;&lt;br /&gt;} else {&lt;br /&gt;    x2 = "hello";&lt;br /&gt;    x = x2;&lt;br /&gt;}&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Now, the value &lt;code&gt;x&lt;/code&gt; is not live at the line &lt;code&gt;x2 = "hello";&lt;/code&gt;, because it hasn't been defined yet, and the previous definition is &lt;a href="http://en.wikipedia.org/wiki/Reaching_definition"&gt;not available&lt;/a&gt; either. So the lifetime interval for &lt;code&gt;x&lt;/code&gt; has a hole at this location.&lt;br /&gt;&lt;br /&gt;To represent such complex live intervals, the &lt;code&gt;live-interval&lt;/code&gt; data type now contains a list of live ranges. Since basic blocks cannot contain control flow or multiple definitions, there can be at most one live range per basic block per live interval. (The only time multiple definitions come up is as a result of Phi node elimination, and since the input is in SSA form this cannot introduce multiple assignments in the same block).&lt;br /&gt;&lt;br /&gt;Information from &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=basis/compiler/cfg/liveness/liveness.factor;hb=HEAD"&gt;liveness analysis&lt;/a&gt; is used to construct live ranges. A basic block's live range contains the first definition of the register as well as the last use. If the register is in the "live in" set of the basic block, then the range begins at the block's first instruction, and if the register is in the "live out" set then the range ends at the block's last instruction. Note that even if a basic block has no usages of a register, it may appear in both the live in and live out sets; for example, the register might be defined in a predecessor block, and used in a successor block.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Allocating physical registers to live intervals&lt;/h3&gt;&lt;br /&gt;Once live intervals have been built, the next step is to assign physical registers to them. During this process, live intervals may be split, with register to memory, memory to register and register to register moves inserted in between; a single virtual register may be in different physical registers and memory locations during its lifetime.&lt;br /&gt;&lt;br /&gt;In the local case, the allocation algorithm maintains two pieces of state; the "unhandled set", which is a &lt;a href="http://docs.factorcode.org/content/article-heaps.html"&gt;heap&lt;/a&gt; of intervals sorted by start position, and an "active set", which is a list of intervals currently assigned to physical registers. The main algorithm removes the smallest element from the unhandled set, and then removes anything from the active set which ends before the new interval begins. The next step depends on whether or not any physical registers are free. If all physical registers are taken up in the active set, then a decision is made to spill either the new interval, or a member of the active set. Spilling will free up at least one physical register. At this point, a physical register is now free and can be assigned to the new interval, which is then added to the active set. Spilling may split intervals into smaller pieces, and add new elements to the unhandled set, but since the new split intervals are strictly smaller than the original interval, the process eventually terminates. Once the unhandled set is empty, allocation is complete.&lt;br /&gt;&lt;br /&gt;The global case is more complicated, primarily due to the presence of lifetime holes in live intervals. A third piece of &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=basis/compiler/cfg/linear-scan/allocation/state/state.factor;hb=HEAD"&gt;state&lt;/a&gt; is maintained by the &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=basis/compiler/cfg/linear-scan/allocation/allocation.factor;hb=HEAD"&gt;allocation pass&lt;/a&gt;. This new set, the "inactive set", contains live intervals which have not ended yet, but are currently in a lifetime hole. To illustrate, suppose we have two physical registers, R1 and R2, and three virtual registers, V1, V2, and V3, with the following live intervals:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;    [Program start ....... Program end]&lt;br /&gt;V1: &amp;lt;======&gt;                &amp;lt;=========&gt;&lt;br /&gt;V2:         &amp;lt;==============&gt;&lt;br /&gt;V3:              &amp;lt;================&gt;&lt;br /&gt;                 ^&lt;br /&gt;                 |&lt;br /&gt;           Current position&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Immediately before V3's live interval begins, the active set includes V2, and the inactive set contains V1, since V1 is not finished yet, but the current position is in a lifetime hole. Suppose that V1 was assigned to physical register R1, and V2 was assigned to physical register R2. In the local case, since the active set only includes one element, V1, we could conclude that R2 was free, and that V3 could be assigned to R2. However, this is invalid, since at a later point, V1 becomes active again, and V1 uses R2. Two overlapping live intervals cannot use the same physical register. So global linear scan also has to examine the inactive set in order to see if the new live interval can fit in an inactive interval's lifetime hole. In this case, it cannot, so we have to split V3, and insert a copy instruction in the middle:&lt;br /&gt;&lt;pre&gt;      [Program start ....... Program end]&lt;br /&gt;V1:   &amp;lt;======&gt;                &lt;=========&gt;&lt;br /&gt;V2:           &amp;lt;==============&gt;&lt;br /&gt;V3_1:              &amp;lt;=========&gt;&lt;br /&gt;V3_2:                         &amp;lt;====&gt;&lt;br /&gt;                   ^&lt;br /&gt;                   |&lt;br /&gt;             Current position&lt;/pre&gt;&lt;br /&gt;With V3 split into V3_1 and V3_2, a valid register assignment is possible, with V3_1 stored in R1 and V3_2 stored in R2.&lt;br /&gt;&lt;br /&gt;In this case, a register assignment without any spilling is possible. However, sometimes, virtual registers have to be stored to memory. This is done by the &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=basis/compiler/cfg/linear-scan/allocation/spilling/spilling.factor;hb=HEAD"&gt;spilling code&lt;/a&gt; which uses a similar algorithm to the local case. The main complication is that in order to free up a physical register for the entire lifetime of the new interval, more than one interval may need to be split; zero or one active intervals, and zero or more inactive intervals.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Assigning physical registers to instructions&lt;/h3&gt;&lt;br /&gt;After live intervals are split, and physical registers are assigned to live intervals, &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=basis/compiler/cfg/linear-scan/assignment/assignment.factor;hb=HEAD"&gt;the assignment pass&lt;/a&gt; iterates over all instructions, storing a mapping from virtual registers to physical registers in each instruction. Recall that this cannot be a global mapping, since a virtual register may reside in several different physical registers at different locations in the program.&lt;br /&gt;&lt;br /&gt;The algorithm here is essentially unchanged from the local case. Two additions are that at the start and end of each basic block, the assignment pass records which registers are live. This information is used by GC checks. GC checks are performed on basic block boundaries of blocks which allocate memory. Since a register may be defined in one basic block, and used in another, with a GC check in between, GC checks need to store registers which contain pointers to a special location in the call stack and pass a pointer to this location to the GC. &lt;a href="http://sbcl.org"&gt;Some language implementations&lt;/a&gt; use a conservative GC to get around having to do this, but I think recording accurate pointer root information is a superior approach. This liveness information is also used by the resolve pass, coming up next.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;The resolve pass&lt;/h3&gt;&lt;br /&gt;Since linear scan does not take control flow into account, interval splitting will give incorrect results if implemented naively. Consider the following example program:&lt;br /&gt;&lt;pre&gt;x = ...;&lt;br /&gt;&lt;br /&gt;if(...) {&lt;br /&gt;    ...&lt;br /&gt;    y = x + 1;&lt;br /&gt;    ... some code with high register pressure here&lt;br /&gt;} else {&lt;br /&gt;    z = x + 2;&lt;br /&gt;}&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;If the live interval of &lt;code&gt;x&lt;/code&gt; was spilled immediately after the line &lt;code&gt;y = x + 1;&lt;/code&gt; then the register allocator would rewrite it as follows -- it would be spilled after the last usage before the spill point, and reloaded before the first usage after the spill point:&lt;br /&gt;&lt;pre&gt;x = ...;&lt;br /&gt;&lt;br /&gt;if(...) {&lt;br /&gt;    ...&lt;br /&gt;    y = x + 1;&lt;br /&gt;    spill(x);&lt;br /&gt;    ... some code with high register pressure here&lt;br /&gt;} else {&lt;br /&gt;    reload(x);&lt;br /&gt;    z = x + 2;&lt;br /&gt;}&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;In this case however, the spill and reload locations are in different control flow paths. However, linear scan does not take control flow into account during the main algorithm. Instead, a subsequent &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=basis/compiler/cfg/linear-scan/resolve/resolve.factor;hb=HEAD"&gt;resolve pass&lt;/a&gt;. The resolve pass looks at every edge in the control flow graph, and compares the physical register and spill slot assignments of all virtual registers which are both live across the edge. If they differ, moves, spills and reloads are inserted on the edge. Conceptually, this pass is simple, but it feels hackish -- it seems that a more elegant register allocator would incorporate this into the previous stages of allocation. However, all formulations of linear scan I've seen so far work this way. The only tricky part about the resolve pass is that the moves have to be performed "in parallel"; for example, suppose we have two virtual registers V1 and V2, and two physical registers R1 and R2. Block A and block B are joined by an edge. At the end of block A, V1 is in R1, and V2 is in R2. At the start of block B, V1 is in R2 and V2 is in R1. Doing the moves in any order would give the wrong result; for correct behavior, the cycle has to be broken with a third temporary location. The logic for doing this in the general case is a bit complex; &lt;a href="http://code-factor.blogspot.com"&gt;Doug Coleman&lt;/a&gt; helped me implement the &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=basis/compiler/cfg/linear-scan/mapping/mapping.factor;hb=HEAD"&gt;mapping code&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;General observations&lt;/h3&gt;&lt;br /&gt;While implementing the register allocator improvements, I relied heavily on unit testing as well as an informal "design by contract"; basically, making extensive use of runtime assertions in the code itself. The &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=tree;f=basis/compiler/cfg/linear-scan;hb=HEAD"&gt;compiler.cfg.linear-scan&lt;/a&gt; vocabulary weighs in at 3828 lines of code; &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=basis/compiler/cfg/linear-scan/linear-scan-tests.factor;hb=HEAD"&gt;2484 lines of this are unit tests.&lt;/a&gt; This is certainly a much higher test:code ratio than most other Factor code I've written, and some of the tests are not very useful or redundant. However, they represent a log of features and bugs I've implemented, and at the very least, minimize regressions.&lt;br /&gt;&lt;br /&gt;When testing the register allocator, I first got it working on x86-64. Here, there are enough registers that spilling almost never occurs, so I could focus on getting the primary logic correct. Then, I hacked the x86-64 backend to only use 4 physical registers, and debugged spilling.&lt;br /&gt;&lt;br /&gt;The whole project took a bit longer than I had hoped, but I managed to start implementing a number of additional optimizations in the meantime. When they are more complete I will blog about them.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17087850-6903509954500347908?l=factor-language.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://factor-language.blogspot.com/feeds/6903509954500347908/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17087850&amp;postID=6903509954500347908' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/6903509954500347908'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/6903509954500347908'/><link rel='alternate' type='text/html' href='http://factor-language.blogspot.com/2009/07/improvements-to-factors-register.html' title='Improvements to Factor&apos;s register allocator'/><author><name>Slava Pestov</name><uri>http://www.blogger.com/profile/02768382790667979877</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17087850.post-7353105402885558994</id><published>2009-05-26T21:19:00.005-04:00</published><updated>2009-05-26T22:36:44.954-04:00</updated><title type='text'>Factor's implementation of polymorphic inline caching</title><content type='html'>Polymorphic inline caching is a technique to speed up method dispatch by generating custom code at method call sites that checks the receiver against only those classes that have come up at that call site. If the receiver's class does not appear in the polymorphic inline cache (this case is known as a "cache miss"), then a new call stub is generated, which checks the receiver's class in addition to the classes already in the stub. This proceeds until the stub reaches a maximum size, after which point further cache misses rewrite the caller to call a general "megamorphic" stub which performs a full method dispatch. A full description can be found in the &lt;a href="http://research.sun.com/self/papers/ecoop91.ps.gz"&gt;original paper on the topic&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;I implemented polymorphic inline caching and megamorphic caching to replace &lt;a href="http://factor-language.blogspot.com/2008/11/further-performance-gains.html"&gt;Factor's previous method dispatch implementation&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Call sites&lt;/h3&gt;&lt;br /&gt;A generic word's call site is in one of three states:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;b&gt;Cold state&lt;/b&gt; - the call instruction points to a stub which looks at the class of the object being dispatched upon, generates a new PIC stub containing a single entry, updates the call site, and jumps to the right method.&lt;/li&gt;&lt;li&gt;&lt;b&gt;Inline cache state&lt;/b&gt; - the call instruction points to a generated PIC stub which has one or more class checks, followed by a jump to a method if the checks succeed. If none of the checks succeed, it jumps to the inline cache miss routine. The miss routine does one of two things:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;If the PIC stub already has the maximum number of entries, it sets the call site to point to the generic word's megamorphic stub, shared by all megamorphic call sites.&lt;/li&gt;&lt;li&gt;If the PIC stub has less than the maximum number of entries, a new PIC stub is generated, with the same set of entries as the last one, together with a new entry for the class being dispatched on.&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;b&gt;Megamorphic state&lt;/b&gt; - the call instruction points to the generic word's megamorphic stub. Further dispatches do not modify the call instruction.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;When code is loaded into the image, or a full garbage collection occurs, all call sites revert to the cold state. Loading code might define new methods or change class relationships, so caches have to be invalidated in that case to preserve correct language semantics. On a full GC, caches are reverted so that serially monomorphic code is optimized better; if a generic word is called in a long loop with one type, then in a long loop with another type, and so on, it will eventually become megamorphic. Resetting call sites once in a while (a full GC is a relatively rare event) ensures that inline caches better reflect what is going on with the code right now. After implementing PICs, I learned that the V8 JavaScript VM uses the same strategy to invalidate call sites once in a while, so it sounds like I did the right thing.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Direct class checks versus subclass checks&lt;/h3&gt;&lt;br /&gt;Factor supports various forms of subtyping. For example, tuple classes can inherit from other tuple classes, and also, a union class can be defined whose members are an arbitrary set of classes. When a method is defined on a class, it will apply to all subclasses as well, unless explicitly overriden. Checking if an object is an instance of a class X is a potentially expensive operation, because it is not enough to compare direct classes, superclasses and union members have to be inspected also. Polymorphic inline caching only deals with direct classes, because direct class checks are much quicker (for tuples, just compare the tuple layouts for pointer equality; for instances of built-in classes, compare tag numbers). So for example if a generic word has a single method for the &lt;a href="http://docs.factorcode.org/content/word-sequence%2csequences.html"&gt;sequence&lt;/a&gt; class, and a call site calls the generic with arrays and strings, then the inline cache generated there will have two checks, one for strings, and one for arrays, and both checks will point at the same method.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Inline cache stubs&lt;/h3&gt;&lt;br /&gt;Inline cache stubs are generated by the VM in the file &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=vm/inline_cache.cpp;hb=HEAD"&gt;inline_cache.cpp&lt;/a&gt;. This generates machine code, but I'll use C-like pseudocode for examples instead. An inline cache stub looks like this, where &lt;code&gt;index&lt;/code&gt; is the parameter number being dispatched on (generic words defined with &lt;a href="http://docs.factorcode.org/content/word-GENERIC__colon__%2csyntax.html"&gt;GENERIC:&lt;/a&gt; use 0, generic words defined with &lt;a href="http://docs.factorcode.org/content/word-GENERIC%23%2csyntax.html"&gt;GENERIC#&lt;/a&gt; can dispatch on any parameter):&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;void *obj = datastack[index];&lt;br /&gt;void *class = class_of(obj);&lt;br /&gt;if(class == cached_class_1) cached_method_1();&lt;br /&gt;else if(class == cached_class_2) cached_methd_2();&lt;br /&gt;...&lt;br /&gt;else cache_miss(call_site_address,index,class,generic);&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Getting the direct class of an object (what I call &lt;code&gt;class_of()&lt;/code&gt; in the pseudocode) is a relatively cheap operation. I briefly discussed pointer tagging, type headers and tuple layout headers in my &lt;a href="http://factor-language.blogspot.com/2009/05/factor-vm-ported-to-c.html"&gt;C++ VM post&lt;/a&gt;. In the most general case, getting the direct class looks like this, again in vague C-like pseudocode:&lt;br /&gt;&lt;pre&gt;if(tag(obj) == hi_tag) return obj.header;&lt;br /&gt;else if(tag(obj) == tuple_layout) return obj.layout;&lt;br /&gt;else return tag(obj);&lt;/pre&gt;&lt;br /&gt;However, if the cached classes consist only of tag classes (the 7 built-in types whose instances can be differentiated based on tag alone) then the generated stub doesn't need to do the other checks; it just compares the pointer tag against the cached classes. Similarly, if tag and tuple classes are in the cache but no hi-tag classes, one branch can be avoided. In total, the PIC generation code uses one of four stubs for getting the object's class, and the right stub to use is determined by looking at the cached classes (see the function &lt;code&gt;determine_inline_cache_type()&lt;/code&gt; in &lt;code&gt;inline_cache.cpp&lt;/code&gt;).&lt;br /&gt;&lt;br /&gt;Once the stub has been generated, it is installed at the call site and invoked. This way, a PIC miss is transparent to the caller.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Patching call sites&lt;/h3&gt;&lt;br /&gt;The code for getting and setting jump targets from jump and call instructions is of course CPU-specific; see the &lt;code&gt;get_call_target()&lt;/code&gt; and &lt;code&gt;set_call_target()&lt;/code&gt; functions, and note that on PowerPC, the instruction cache needs to be flushed manually:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=vm/cpu-x86.hpp;hb=HEAD"&gt;cpu-x86.hpp&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=vm/cpu-ppc.hpp;hb=HEAD"&gt;cpu-ppc.hpp&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;br /&gt;Recall that there are two situations in which the PIC miss code can be invoked:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;For a cold call site, to generate the initial stub&lt;/li&gt;&lt;li&gt;For a polymorphic call site, to add a new entry to the PIC&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;If the call site is a non-tail call, then the return address will be pushed on the stack (x86) or stored in a link register (PowerPC). If the call site is a tail call, then Factor's code generator now moves the current instruction pointer into a register before performing the tail call. This code is generated even when tail-calling words that are not generic; it is a no-op in this case. This is only a single integer load, so other than increasing code size it really has no effect on performance, since on a modern out-of-order CPU the load and jump will execute concurrently. For example, here is how the code snippet &lt;code&gt;beer get length&lt;/code&gt; compiles on x86:&lt;br /&gt;&lt;pre&gt;0x000000010b8628c0: mov    $0x10b8628c0,%r8&lt;br /&gt;0x000000010b8628ca: pushq  $0x20&lt;br /&gt;0x000000010b8628cf: push   %r8&lt;br /&gt;0x000000010b8628d1: sub    $0x8,%rsp&lt;br /&gt;0x000000010b8628d5: add    $0x10,%r14&lt;br /&gt;0x000000010b8628d9: mov    $0x10a6c8186,%rax&lt;br /&gt;0x000000010b8628e3: mov    %rax,-0x8(%r14)&lt;br /&gt;0x000000010b8628e7: mov    $0x100026c60,%rax&lt;br /&gt;0x000000010b8628f1: mov    (%rax),%rcx&lt;br /&gt;0x000000010b8628f4: mov    %rcx,(%r14)&lt;br /&gt;0x000000010b8628f7: callq  0x10b1e0de0&lt;br /&gt;0x000000010b8628fc: add    $0x18,%rsp&lt;br /&gt;0x000000010b862900: mov    $0x10b86290f,%rbx&lt;br /&gt;0x000000010b86290a: jmpq   0x10b70f4a0&lt;/pre&gt;&lt;br /&gt;The first few instructions set up a call stack frame, then the &lt;code&gt;beer&lt;/code&gt; symbol is pushed on the stack, and the (inlined) definition of &lt;code&gt;get&lt;/code&gt; follows. Finally, the address of the jump instruction is loaded into the &lt;code&gt;rbx&lt;/code&gt; register and a jump is performed. Initially, the jump points at the PIC miss stub, so the first time this machine code executes, the jump instruction's target will be changed to a newly-generated PIC.&lt;br /&gt;&lt;br /&gt;However, the &lt;code&gt;inline_cache_miss()&lt;/code&gt; function in the VM takes the call site address as an input parameter, and after all, it is written in C++. Where does this input parameter actually come from? The answer is that there are short assembly stubs that sit between the actual call site and the PIC miss code in the VM. The stubs get the return address in a CPU-specific manner, and then call the VM. This code is in the &lt;code&gt;primitive_inline_cache_miss&lt;/code&gt; procedure, which is again defined in several places:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=vm/cpu-x86.32.S;hb=HEAD"&gt;cpu-x86.32.S&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=vm/cpu-x86.64.S;hb=HEAD"&gt;cpu-x86.64.S&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=vm/cpu-ppc.S;hb=HEAD"&gt;cpu-ppc.S&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;Note that the functions have two entry points. The first entry point is taken for non-tail call sites, the second one is taken for tail call sites.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;An optimization&lt;/h3&gt;&lt;br /&gt;Since allocation of PICs is something that happens rather frequently, it is good to optimize this operation. I implemented segregated free lists for the code heap. Allocating lots of small code heap blocks requires less work now. Also, since every PIC is referenced from at most one call site, when a call site's PIC is regenerated with a new set of cached classes, the only PIC can be returned to the free list immediately. This reduces the frequency of full garbage collections (when the code heap fills up, a full GC must be performed; there is no generational tenuring for code blocks).&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Megamorphic caching&lt;/h3&gt;&lt;br /&gt;If a call site is called with more than a few distinct classes of instances (the default maximum PIC size is 3 entries) then it is patched to call the megamorphic dispatch stub for that generic word. Megamorphic stubs are again generated in the VM, in the &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=vm/dispatch.cpp;hb=HEAD"&gt;dispatch.cpp&lt;/a&gt; source file. The machine code that is generated, were it to be expressed in pseudo-C, looks like this:&lt;br /&gt;&lt;pre&gt;void *obj = datastack[index];&lt;br /&gt;void *class = class_of(obj);&lt;br /&gt;if(cache[class] == class) goto cache[class + 1];&lt;br /&gt;else megamorphic_cache_miss(class,generic,cache);&lt;/pre&gt;&lt;br /&gt;Every generic word has a fixed-size megamorphic cache of 32-entries (this can be changed by bootstrapping a new image, but I'm not sure there's any point). It works like a hashtable, except if there is a collision, the old entry is simply evicted; there is no bucket chaining or probing strategy here. While less efficient than a polymorphic inline cache hit (which only entails a direct jump), megamorphic cache hits are still rather quick (some arithmetic and an indirect jump). There are no calls into the VM, it is all inline assembly. A megamorphic inline cache miss calls the &lt;code&gt;mega_cache_miss&lt;/code&gt; primitive which computes the correct method and updates the cache.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Performance&lt;/h3&gt;&lt;br /&gt;I added some instrumentation code to the VM which counts the number of cache hits and misses. The majority of call sites are monomorphic or polymorphic, and megamorphic call sites are very rare. Megamorphic cache misses are rarer still.&lt;br /&gt;&lt;br /&gt;The performance gain for benchmark runtimes was as I expected; a 5-10% gain. The old method dispatch system was already pretty efficient, and modern CPUs have branch prediction which helped a lot with it. For compile time, the gain was a lot more drastic, however, and definitely made the effort I put into implementing PICs pay off; bootstrap is now a full minute faster than before, clocking in at around 3 minutes, and the &lt;code&gt;load-all&lt;/code&gt; word, which loads and compiles all the libraries in the distribution, used to take 25 minutes and now takes 11 minutes. All timings are from my MacBook Pro.&lt;br /&gt;&lt;br /&gt;The compile-time gain is not a direct result of the inline caching, but rather stems from the fact that the compiler has to compile less code. With the old method dispatch implementation, every generic word was an ordinary word under the hood, with a huge, auto-generated body containing conditionals and jump tables. When a method was added or removed, the dispatcher code had to be re-generated and re-compiled, which takes a long time. With polymorphic inline caching, what happens is now the VM essentially lazily generates just those parts of the dispatch code which are actually used, and since it uses the VM's JIT code instead of the optimizing compiler, it can just glue bits of machine code together instead of building an IR and optimizing it; which yielded no benefits for the type of code that appeared in a generic word body anyway.&lt;br /&gt;&lt;br /&gt;Improving compile time is one of my major goals for Factor 1.0, since it improves user experience when doing interactive development, and I'm very happy with the progress so far.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17087850-7353105402885558994?l=factor-language.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://factor-language.blogspot.com/feeds/7353105402885558994/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17087850&amp;postID=7353105402885558994' title='6 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/7353105402885558994'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/7353105402885558994'/><link rel='alternate' type='text/html' href='http://factor-language.blogspot.com/2009/05/factors-implementation-of-polymorphic.html' title='Factor&apos;s implementation of polymorphic inline caching'/><author><name>Slava Pestov</name><uri>http://www.blogger.com/profile/02768382790667979877</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>6</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17087850.post-122066567209030778</id><published>2009-05-22T19:38:00.002-04:00</published><updated>2009-05-22T19:59:19.734-04:00</updated><title type='text'>Live status display for Factor's build farm, and other improvements</title><content type='html'>Lately I've made a few additional improvements to the build infrastructure. First, build reports sent to the &lt;a href="https://lists.sourceforge.net/lists/listinfo/factor-builds"&gt;Factor-builds mailing list&lt;/a&gt; are now formatted as HTML, which makes them a bit more readable. Second, there is a Twitter feed of binary uploads: &lt;a href="http://twitter.com/FactorBuilds"&gt;@FactorBuilds&lt;/a&gt;. Finally, we now have a live status display for build machines. Previously if I wanted to know what a build machine was doing, I'd have to log in with ssh and poke around -- total waste of time. Now, just clicking on the OS/CPU combination on &lt;a href="http://factorcode.org"&gt;factorcode.org&lt;/a&gt; takes you to a status page. For example, here is the status page for &lt;a href="http://builds.factorcode.org/download?os=macosx&amp;cpu=x86.32"&gt;Mac OS X/x86&lt;/a&gt;. At the time of this blog post, I see that this machine is running tests for a certain GIT revision. Each GIT revision is a link to the &lt;a href="http://github.com/slavapestov/factor"&gt;GitHub mirror of the Factor repository&lt;/a&gt; so you can see exactly what's going on. Finally, the latest build report can be viewed too; this has the build failures, if any, as well as benchmark timings.&lt;br /&gt;&lt;br /&gt;Our continuous build system is called &lt;a href="http://docs.factorcode.org/content/vocab-mason.html"&gt;mason&lt;/a&gt; and adds up to a total of 1196 lines of code. Eduardo Cavazos wrote the first iteration, called &lt;code&gt;builder&lt;/code&gt;. I forked it and added additional features.&lt;br /&gt;&lt;br /&gt;It's been over a year since we started using continuous integration in the Factor project, and I can say its been an overwhelming success. The first iteration of the build system would load all libraries and run all unit tests, and send an email with the results. If everything succeeded, it would upload a binary package. Over the last year, &lt;a href="http://downloads.factorcode.org"&gt;more than 5000 binary packages were uploaded&lt;/a&gt;. Over time, we added more checks to the build farm. It now runs &lt;a href="http://docs.factorcode.org/content/article-help.lint.html"&gt;help lint&lt;/a&gt; checks which ensure that examples in documentation work and that there are no broken links, and checks for &lt;a href="http://docs.factorcode.org/content/article-compiler-errors.html"&gt;compiler errors&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;I think the code quality has definitely gone up over the last year; having a dozen machines running tests all the time does a good job of triggering obscure bugs, and the automatic upload of binaries when tests pass saves us from the hassle of making manual releases.&lt;br /&gt;&lt;br /&gt;I think pretty soon, we're going to start having releases again, in addition to continuous builds. I intend on making the release process semi-automatic; when I decide to do a release, I want to send some type of command to all the build machines to have them build a given GIT ID and upload the binaries to a special directory. The goal is to have regular releases until 1.0, starting from a few weeks from now. I'm not going to commit to a schedule for 1.0 yet, but having regular releases and published change logs is the first step of the process.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17087850-122066567209030778?l=factor-language.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://factor-language.blogspot.com/feeds/122066567209030778/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17087850&amp;postID=122066567209030778' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/122066567209030778'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/122066567209030778'/><link rel='alternate' type='text/html' href='http://factor-language.blogspot.com/2009/05/live-status-display-for-factors-build.html' title='Live status display for Factor&apos;s build farm, and other improvements'/><author><name>Slava Pestov</name><uri>http://www.blogger.com/profile/02768382790667979877</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17087850.post-2925295840157439666</id><published>2009-05-21T20:25:00.006-04:00</published><updated>2009-05-22T17:12:17.463-04:00</updated><title type='text'>Unboxed tuple arrays in Factor</title><content type='html'>One difference between high level languages and systems languages is that high level languages typically hide the notion of a pointer, and by extension, value versus reference semantics, from the programmer. Systems languages make this explicit, and the programmer can make a choice whether or not to pass an object by value or by reference.&lt;br /&gt;&lt;br /&gt;In particular, this level of control can lead to significant space savings for large arrays of homogeneous data. If the elements of an array are themselves not shared by other structures, then storing values directly in the array will, at the very least, save on the space usage of a pointer, together with an object header.&lt;br /&gt;&lt;br /&gt;I don't believe there is any reason for high-level languages to not offer more fine-grained control over heap storage. Factor is dynamically typed but it offers both &lt;a href="http://factor-language.blogspot.com/2008/12/arrays-of-unboxed-primitive-values-and.html"&gt;arrays of unboxed numeric values&lt;/a&gt;, and arrays of tuples, which I will describe below. In some cases, using these special-purpose data types can offer significant performance and space gains over polymorphic arrays.&lt;br /&gt;&lt;br /&gt;Consider a Factor tuple definition:&lt;br /&gt;&lt;pre&gt;TUPLE: point x y ;&lt;/pre&gt;&lt;br /&gt;We can find out how much space points use in memory -- this is on a x86-64 machine:&lt;br /&gt;&lt;pre&gt;( scratchpad ) point new size .&lt;br /&gt;32&lt;/pre&gt;&lt;br /&gt;The x and y slots are as wide as a machine pointer, so together they use 16 bytes. This means that there is an additional 16-byte overhead for every instance. Indeed, Factor tuples store a header and a pointer to a layout object (shared by all instances of that class) in addition to the slot data itself. While this compares favorably to other dynamic languages (&lt;a href="http://eigenclass.org/R2/writings/object-size-ruby-ocaml"&gt;see this blog post for instance&lt;/a&gt;; Ruby in particular has really terrible object representation), it still adds up to a significant overhead when you have a large collection of points.&lt;br /&gt;&lt;br /&gt;Suppose you have an &lt;a href="http://docs.factorcode.org/content/article-arrays.html"&gt;array&lt;/a&gt; of N points. Each array element is a pointer (8 bytes) to a point object (32 bytes). Assuming the points are not referenced from any other collection, that's 40*N bytes for the entire collection. However, there is only 16*N bytes of real data, namely the x and y slots of every element, so we're wasting more than 50% of heap usage here, not to mention increasing GC overhead by allocating tons of small objects.&lt;br /&gt;&lt;br /&gt;This overhead can be avoided, however, using Factor's &lt;a href="http://docs.factorcode.org/content/vocab-tuple-arrays.html"&gt;tuple-arrays&lt;/a&gt; library. Tuple arrays implement the same &lt;a href="http://docs.factorcode.org/content/article-sequence-protocol.html"&gt;sequence protocol&lt;/a&gt; as arrays, but the semantics differ. Instead of storing references to objects, they store object slots directly within the array. There are two main restrictions that result from this:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Tuple arrays are homogeneous; all elements are of the exact same type. This rules out setting some elements to &lt;code&gt;f&lt;/code&gt; to represent missing data, or having elements which are instances of different subclasses of the same base class.&lt;/li&gt;&lt;li&gt;Tuple arrays implement value semantics. If you get an element out, and mutate it, you have to store it back into the array to update the original value in the array.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;So how do tuple arrays work? The original implementation, by &lt;a href="http://useless-factor.blogspot.com"&gt;Daniel Ehrenberg&lt;/a&gt; defined a single &lt;code&gt;tuple-array&lt;/code&gt; class where the constructor took a class word as a parameter, and reflection APIs were used to implement getting and setting elements. While this had all the space savings described above, constructing and deconstructing tuples via reflection is not the fastest thing in the world; even the most efficient possible implementation would require some indirection and type checking in order to work.&lt;br /&gt;&lt;br /&gt;A better approach is to generate a new sequence type for each element type. This is what the current implementation does. Because all of Factor's sequence operations are written in terms of a sequence protocol, these generated classes are for the most part transparent to the programmer, and there is no added inconvenience over the previous approach. To use it, you invoke a parsing word with the element type class you want to use:&lt;br /&gt;&lt;pre&gt;TUPLE-ARRAY: point&lt;/pre&gt;&lt;br /&gt;Assuming a class named &lt;code&gt;point&lt;/code&gt; was previously defined, this generates a &lt;code&gt;point-array&lt;/code&gt; class with associated methods.&lt;br /&gt;&lt;br /&gt;Briefly, the implementation works as follows. The underlying storage of a tuple array is an ordinary array. If the tuple class has N slots, then every group of N elements in the underlying array is a single element in the tuple array. The &lt;code&gt;nth&lt;/code&gt; and &lt;code&gt;set-nth&lt;/code&gt; methods, which are generated once for every tuple array, read and write a group of N elements and package them into a tuple.&lt;br /&gt;&lt;br /&gt;Now let's dive into the guts of the implementation. The parsing word is defined as follows:&lt;br /&gt;&lt;pre&gt;SYNTAX: TUPLE-ARRAY: scan-word define-tuple-array ;&lt;/pre&gt;&lt;br /&gt;So it delegates all the hard work to a &lt;code&gt;define-tuple-array&lt;/code&gt; word. This word is a functor, so tuple arrays work the same as specialized numeric arrays. Functors are similar to C++ templates in spirit, but really they are implemented entirely as a library, and it's all just syntax sugar for parsing word machinery (which is a lot more powerful than C++ templates). They're named "functors" to annoy category theory fanboys and language purists. A functor definition begins with &lt;code&gt;FUNCTOR:&lt;/code&gt;:&lt;br /&gt;&lt;pre&gt;FUNCTOR: define-tuple-array ( CLASS -- )&lt;/pre&gt;&lt;br /&gt;A functor is essentially the same as a word defined with &lt;code&gt;::&lt;/code&gt;, but with additional syntax sugar. So &lt;code&gt;CLASS&lt;/code&gt; here is a local variable available in the body of the functor. I use the convention of uppercase names for functor parameters. What follows &lt;code&gt;FUNCTOR:&lt;/code&gt; is a list of clauses which create new words at parse time. The left hand side of each clause is a new local variable name that the generated word will be bound to, and the right hand side is an &lt;a href="http://docs.factorcode.org/content/vocab-interpolate.html"&gt;interpolate&lt;/a&gt; form describing what the new word should be named. The middle operator specifies whether or not the word should already exist (&lt;code&gt;IS&lt;/code&gt;) or if it should be defined (&lt;code&gt;DEFINED&lt;/code&gt;). So the definition of &lt;code&gt;define-tuple-array&lt;/code&gt; proceeds as follows:&lt;br /&gt;&lt;pre&gt;CLASS IS ${CLASS}&lt;br /&gt;&lt;br /&gt;CLASS-array DEFINES-CLASS ${CLASS}-array&lt;br /&gt;CLASS-array? IS ${CLASS-array}?&lt;br /&gt;&lt;br /&gt;&amp;lt;CLASS-array&gt; DEFINES &amp;lt;${CLASS}-array&gt;&lt;br /&gt;&gt;CLASS-array DEFINES &gt;${CLASS}-array&lt;/pre&gt;&lt;br /&gt;Next, a functor definition calls &lt;code&gt;WHERE&lt;/code&gt;, which switches the parser to reading the functor body.&lt;br /&gt;&lt;br /&gt;After that, what follows is a series of forms which look like word definitions, but really they parse as code which defines words at the time that the functor is executed. Both the word names and bodies may reference local variables defined in the first part of the functor:&lt;br /&gt;&lt;pre&gt;WHERE&lt;br /&gt;&lt;br /&gt;TUPLE: CLASS-array&lt;br /&gt;{ seq array read-only }&lt;br /&gt;{ n array-capacity read-only }&lt;br /&gt;{ length array-capacity read-only } ;&lt;br /&gt;&lt;br /&gt;: &amp;lt;CLASS-array&gt; ( length -- tuple-array )&lt;br /&gt;    [ \ CLASS [ tuple-prototype &amp;lt;repetition&gt; concat ] [ tuple-arity ] bi ] keep&lt;br /&gt;    \ CLASS-array boa ; inline&lt;br /&gt;&lt;br /&gt;M: CLASS-array length length&gt;&gt; ;&lt;br /&gt;&lt;br /&gt;M: CLASS-array nth-unsafe tuple-slice \ CLASS read-tuple ;&lt;br /&gt;&lt;br /&gt;M: CLASS-array set-nth-unsafe tuple-slice \ CLASS write-tuple ;&lt;br /&gt;&lt;br /&gt;M: CLASS-array new-sequence drop &amp;lt;CLASS-array&gt; ;&lt;br /&gt;&lt;br /&gt;: &gt;CLASS-array ( seq -- tuple-array ) 0 &amp;lt;CLASS-array&gt; clone-like ;&lt;br /&gt;&lt;br /&gt;M: CLASS-array like drop dup CLASS-array? [ &gt;CLASS-array ] unless ;&lt;br /&gt;&lt;br /&gt;INSTANCE: CLASS-array sequence&lt;/pre&gt;&lt;br /&gt;So every time someone calls &lt;code&gt;define-tuple-array&lt;/code&gt; (most likely with the &lt;code&gt;TUPLE-ARRAY:&lt;/code&gt; parsing word) a new class is defined, together with a constructor word and some methods.&lt;br /&gt;&lt;br /&gt;Finally, we end the functor with:&lt;br /&gt;&lt;pre&gt;;FUNCTOR&lt;/pre&gt;&lt;br /&gt;To illustrate, this means that the following:&lt;br /&gt;&lt;pre&gt;TUPLE-ARRAY: point&lt;/pre&gt;&lt;br /&gt;Is equivalent to the following:&lt;br /&gt;&lt;pre&gt;TUPLE: point-array&lt;br /&gt;{ seq array read-only }&lt;br /&gt;{ n array-capacity read-only }&lt;br /&gt;{ length array-capacity read-only } ;&lt;br /&gt;&lt;br /&gt;: &amp;lt;point-array&gt; ( length -- tuple-array )&lt;br /&gt;    [ \ point [ tuple-prototype &amp;lt;repetition&gt; concat ] [ tuple-arity ] bi ] keep&lt;br /&gt;    \ point-array boa ; inline&lt;br /&gt;&lt;br /&gt;M: point-array length length&gt;&gt; ;&lt;br /&gt;&lt;br /&gt;M: point-array nth-unsafe tuple-slice \ point read-tuple ;&lt;br /&gt;&lt;br /&gt;M: point-array set-nth-unsafe tuple-slice \ point write-tuple ;&lt;br /&gt;&lt;br /&gt;M: point-array new-sequence drop &amp;lt;point-array&gt; ;&lt;br /&gt;&lt;br /&gt;: &gt;point-array ( seq -- tuple-array ) 0 &amp;lt;point-array&gt; clone-like ;&lt;br /&gt;&lt;br /&gt;M: point-array like drop dup point-array? [ &gt;point-array ] unless ;&lt;br /&gt;&lt;br /&gt;INSTANCE: point-array sequence&lt;/pre&gt;&lt;br /&gt;Now imagine writing all of the above code out every time you want a specialized tuple array; that would amount to boilerplate. Of course, at this point, it still looks like runtime reflection is happening, because we're passing the &lt;code&gt;point&lt;/code&gt; class word to the &lt;code&gt;read-tuple&lt;/code&gt; and &lt;code&gt;write-tuple&lt;/code&gt; words. However, what actually happens is that these words get inlined, then the fact that the class parameter is a constant triggers a series of macro expansions and constant folding optimizations that boil the code down to an efficient, specialized routine for packing and unpacking points from the array. If you're interested in the gory details, take a look at the &lt;a href="http://github.com/slavapestov/factor/blob/483c936eb335c179aec59ec6d248a7f472584e01/basis/tuple-arrays/tuple-arrays.factor"&gt;tuple arrays source code&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;So the space savings are clearly nice to have, but what about performance? Here is a program that uses polymorphic arrays:&lt;br /&gt;&lt;pre&gt;TUPLE: point { x float } { y float } ;&lt;br /&gt;&lt;br /&gt;: polymorphic-array-benchmark ( -- )&lt;br /&gt;    5000 [ point new ] replicate&lt;br /&gt;    1000 [&lt;br /&gt;        [&lt;br /&gt;            [ 1+ ] change-x&lt;br /&gt;            [ 1- ] change-y&lt;br /&gt;        ] map&lt;br /&gt;    ] times drop ;&lt;/pre&gt;&lt;br /&gt;Here is the same program using tuple arrays:&lt;br /&gt;&lt;pre&gt;TUPLE: point { x float } { y float } ;&lt;br /&gt;&lt;br /&gt;TUPLE-ARRAY: point&lt;br /&gt;&lt;br /&gt;: tuple-array-benchmark ( -- )&lt;br /&gt;    5000 &amp;lt;point-array&gt; &lt;br /&gt;    1000 [&lt;br /&gt;        [&lt;br /&gt;            [ 1+ ] change-x&lt;br /&gt;            [ 1- ] change-y&lt;br /&gt;        ] map&lt;br /&gt;    ] times drop ;&lt;/pre&gt;&lt;br /&gt;On my MacBook Pro, the first version using polymorphic arrays runs in 0.39 seconds, the second version using tuple arrays runs in 0.91 seconds. Clearly, there is some overhead from copying the objects in and out of the array.&lt;br /&gt;&lt;br /&gt;Now, let's try a slightly different program. Instead of mutating the elements of the array, let's instead extract the slots, modify them, and make a new tuple. Here is a version using polymorphic arrays:&lt;br /&gt;&lt;pre&gt;TUPLE: point { x float read-only } { y float read-only } ;&lt;br /&gt;&lt;br /&gt;: polymorphic-array-benchmark ( -- )&lt;br /&gt;    5000 [ point new ] replicate&lt;br /&gt;    1000 [&lt;br /&gt;        [&lt;br /&gt;            [ x&gt;&gt; 1+ ] [ y&gt;&gt; 1- ] bi &amp;lt;point&gt;&lt;br /&gt;        ] map&lt;br /&gt;    ] times drop ;&lt;/pre&gt;&lt;br /&gt;This version runs in 0.94 seconds; the additional object allocation induces a 3x overhead over mutating the points in place. Now, let's try the same thing with tuple arrays:&lt;br /&gt;&lt;pre&gt;TUPLE: point { x float read-only } { y float read-only } ;&lt;br /&gt;&lt;br /&gt;TUPLE-ARRAY: point&lt;br /&gt;&lt;br /&gt;: tuple-array-benchmark ( -- )&lt;br /&gt;    5000 &amp;lt;point-array&gt; &lt;br /&gt;    1000 [&lt;br /&gt;        [&lt;br /&gt;            [ x&gt;&gt; 1+ ] [ y&gt;&gt; 1- ] bi &amp;lt;point&gt;&lt;br /&gt;        ] map&lt;br /&gt;    ] times drop ;&lt;/pre&gt;&lt;br /&gt;For this variant I'm getting runtimes of 0.59 seconds, which is faster than the same code with polymorphic arrays. The reason for the speed boost is that the Factor compiler's &lt;a href="http://factor-language.blogspot.com/2008/08/algorithm-for-escape-analysis.html"&gt;escape analysis&lt;/a&gt; pass kicks in, eliminating all tuple allocation from within the loop completely.&lt;br /&gt;&lt;br /&gt;So to summarize, in order from fastest to slowest,&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Polymorphic arrays with in-place mutation&lt;/li&gt;&lt;li&gt;Tuple arrays with read-only tuples&lt;/li&gt;&lt;li&gt;Polymorphic arrays with read-only tuples&lt;/li&gt;&lt;li&gt;Tuple arrays with in-place mutation&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;Finally, a note about benchmarking methodology. Because garbage collection can add an element of unpredictability to benchmarks, I force run the GC to run first:&lt;br /&gt;&lt;pre&gt;( scratchpad ) gc [ tuple-array-benchmark ] time&lt;br /&gt;== Running time ==&lt;br /&gt;&lt;br /&gt;0.593989 seconds&lt;/pre&gt;&lt;br /&gt;The combination of a high-level language with efficient abstractions that the compiler knows how to optimize is very powerful. Factor occupies a middle ground between the dynamic languages and systems languages. I want Factor to be a language that you can write an entire application in, even the performance-critical parts. Often you hear that "performance doesn't matter", but outside of a few I/O-bound problem domains, this is false (and even there, you want to heavily optimize your I/O and text processing routines). Very often, people cite some variation of the "80/20 performance rule" (20% of your program runs 80% of the time; adjust percentages to suit your strawman). Of course, the idea is that you can write the 80% in a high level language, dropping down to C for the performance-critical 20%. I think this reasoning is flawed, for two reasons:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;As my friend Cameron Zwarich likes to point out, very little justification is given for the 80/20 rule itself. For many programs, the performance profile is rather flat, and every indirection, every unnecessary runtime check, and every unnecessary runtime memory allocation adds up to a non-trivial amount of overhead. The "performance bottleneck" myth originated in a survey of Fortran programs which did numerical calculations. For numerics, it is indeed true that most of the runtime is spent in a handful of inner loops; whether or not this is true for ordinary programs is far less clear.&lt;/li&gt;&lt;li&gt;Even if runtime really is dominated by a small number of routines, chances are these routines are algorithmically complex, hard to write, and nice to experiment with interactively; precisely the domain where high-level languages with advanced abstractions shine.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;I think it's about time dynamic language advocates stopped citing the same old myths to justify poor implementation techniques. Their energy would be better spent researching compiler optimizations and garbage collection techniques which they can apply in their favorite language implementations instead. Hats off to the &lt;a href="http://code.google.com/p/v8/"&gt;V8&lt;/a&gt;, &lt;a href="http://trac.webkit.org/wiki/SquirrelFish"&gt;SquirrelFish&lt;/a&gt;, &lt;a href="http://code.google.com/p/unladen-swallow/"&gt;Unladen Swallow&lt;/a&gt; and &lt;a href="http://sbcl.org"&gt;SBCL&lt;/a&gt; projects for taking performance seriously.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17087850-2925295840157439666?l=factor-language.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://factor-language.blogspot.com/feeds/2925295840157439666/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17087850&amp;postID=2925295840157439666' title='8 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/2925295840157439666'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/2925295840157439666'/><link rel='alternate' type='text/html' href='http://factor-language.blogspot.com/2009/05/unboxed-tuple-arrays-in-factor.html' title='Unboxed tuple arrays in Factor'/><author><name>Slava Pestov</name><uri>http://www.blogger.com/profile/02768382790667979877</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>8</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17087850.post-504950899931887926</id><published>2009-05-08T14:06:00.008-04:00</published><updated>2009-05-23T13:37:35.317-04:00</updated><title type='text'>Factor VM ported to C++</title><content type='html'>Most of Factor is implemented in Factor (the parser, data structures, optimizing compiler, input/output, the UI toolkit, etc) however a few bits and pieces are not completely self hosting. This includes the garbage collector, non-optimizing compiler (used for bootstrap), and various runtime services. This code comprises what I call the "Factor VM". Even though it is not a Virtual Machine in the classical sense, since there's no interpreter or bytecode format, it serves a similar role to the VM of scripting languages.&lt;br /&gt;&lt;br /&gt;Ever since Factor lost its JVM dependency in mid-2004, the VM has been written in C with some GNU extensions. Over time, I've noticed that a few things in the &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=tree;f=vm;hb=HEAD"&gt;VM source code&lt;/a&gt; seem to be hard to express and error-prone in straight C. I've been thinking about switching to C++ for a long time, and now I've finally decided to bite the bullet and do it. I'm pleased with the result, and other than a longer compile time for the VM, I don't have any complaints so far.&lt;br /&gt;&lt;br /&gt;I started off by first getting the existing C code to compile with the GNU C++ compiler. Since C++ is almost, but not quite, a super set of C, I had to make a few changes to the source:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Renaming a few identifiers (I had local variables named &lt;code&gt;class&lt;/code&gt;, &lt;code&gt;template&lt;/code&gt; and &lt;code&gt;new&lt;/code&gt; in some places)&lt;/li&gt;&lt;li&gt;Fixing some type safety issues (C will implicitly cast &lt;code&gt;void*&lt;/code&gt; to any other pointer type, whereas C++ will not)&lt;/li&gt;&lt;li&gt;Moving global variable declarations to source files, and changing the declarations in the headers to be &lt;code&gt;extern&lt;/code&gt;&lt;/li&gt;&lt;li&gt;Making a few functions &lt;code&gt;extern "C"&lt;/code&gt; since they are called from code generated by the Factor compiler&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;Once this was done and everything was working, I started the process of refactoring my C pain points using C++ language features and idioms. This process is far from complete; at the end of the blog post I will outline what remains to be done.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Miscellaneous improvements&lt;/h3&gt;&lt;br /&gt;While the Factor VM is pretty small, around 13,000 lines of code right now, being able to use C++ namespaces is still nice since I've had name clashes with system C headers in the past. Now everything is wrapped in a &lt;code&gt;factor&lt;/code&gt; namespace, so this should occur less in the future. Also, I changed a bunch of &lt;code&gt;#define&lt;/code&gt; to inline functions and constants.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Tagged pointers&lt;/h3&gt;&lt;br /&gt;Factor is dynamically typed. While the optimizing compiler infers types and eliminates type checks sometimes, this generates machine code directly and mostly sidesteps the VM entirely. As far as the VM is concerned, it must be possible to determine the type of a value from its runtime representation. There are many approaches to doing this; the one I've been using from the start is "tagged pointers".&lt;br /&gt;&lt;br /&gt;Here is how it works: in the Factor VM, objects are always allocated on 8-byte boundaries in the data heap. This means every address is a multiple of 8, or equivalently, the least significant 3 bits of a raw address are zero. I take advantage of this to store some type information in these bits. A related trick that the Factor VM does is to reserve a tag for small integers that fit in a pointer; so a 29-bit integer (on 32-bit platforms) or a 61-bit integer (on 64-bit platforms) does not need to be heap-allocated.&lt;br /&gt;&lt;br /&gt;With 3 tag bits, we get a total of 8 unique tags. Factor has more than 8 data types, though, and indeed the user can define their own data types too. So there are two tags used to denote that the data type is actually stored in the object's header. One of these is used for built-in VM types that don't have a unique tag number, another one is used for user-defined &lt;a href="http://docs.factorcode.org/content/article-tuples.html"&gt;tuple class&lt;/a&gt; instances.&lt;br /&gt;&lt;br /&gt;Since C and C++ do not have native support for tagged pointers, the VM needs a mechanism to convert tagged pointers to untagged pointers, and vice versa. When converting a tagged pointer to an untagged pointer, a type check is performed.&lt;br /&gt;&lt;br /&gt;In the C VM, I had code like the following:&lt;br /&gt;&lt;pre&gt;#define TAG_MASK 7&lt;br /&gt;#define UNTAG(tagged) (tagged &amp; ~TAG_MASK)&lt;br /&gt;&lt;br /&gt;typedef struct { ... } F_ARRAY;&lt;br /&gt;&lt;br /&gt;#define ARRAY_TYPE 2&lt;br /&gt;&lt;br /&gt;F_ARRAY *untag_array(CELL tagged)&lt;br /&gt;{&lt;br /&gt;    type_check(ARRAY_TYPE,tagged);&lt;br /&gt;    return (F_ARRAY *)UNTAG(tagged);&lt;br /&gt;}&lt;/pre&gt;&lt;br /&gt;Here, &lt;code&gt;CELL&lt;/code&gt; is typedef'd to an unsigned integer as wide as &lt;code&gt;void*&lt;/code&gt;, and &lt;code&gt;UNTAG&lt;/code&gt; is a macro which untags a pointer by masking off the low 3 bits (7 decimal is 111 binary).&lt;br /&gt;&lt;br /&gt;To convert untagged pointers to tagged form, I used code like the following:&lt;br /&gt;&lt;pre&gt;#define RETAG(untagged,tag) (((CELL)(untagged) &amp; ~TAG_MASK) | (tag))&lt;br /&gt;&lt;br /&gt;CELL tag_array(F_ARRAY *untagged)&lt;br /&gt;{&lt;br /&gt;    return RETAG(untagged,ARRAY_TYPE);&lt;br /&gt;}&lt;/pre&gt;&lt;br /&gt;For hi-tag types, such as strings, I used a single &lt;code&gt;tag_object()&lt;/code&gt; function:&lt;br /&gt;&lt;pre&gt;CELL tag_object(void *untagged)&lt;br /&gt;{&lt;br /&gt;    return RETAG(untagged,OBJECT_TYPE);&lt;br /&gt;}&lt;/pre&gt;&lt;br /&gt;The &lt;code&gt;OBJECT_TYPE&lt;/code&gt; constant was misleadingly named, since everything in Factor is an object. However, in the VM it means "an object with type info in its header".&lt;br /&gt;&lt;br /&gt;The old C code for tagging and untagging pointers was mostly adequate, however it suffered from a few issues. First, there was a lot of boilerplate; each tag type has an untag/tag function pair, and each hi-tag type had an untag function. Furthermore, every time I changed a hi-tag type into a tag type, I had to find usages of &lt;code&gt;tag_object()&lt;/code&gt; and change them appropriately. Over time I developed several preprocessor macros to clean up this type of stuff but I was never happy with it.&lt;br /&gt;&lt;br /&gt;In C++, the whole thing came out much cleaner. The &lt;code&gt;UNTAG&lt;/code&gt; and &lt;code&gt;RETAG&lt;/code&gt; macros are still around (although I may change them to inline functions at some point). However, for tagging and untagging pointers, I use a single set of template functions. To tag an untagged pointer:&lt;br /&gt;&lt;pre&gt;template &amp;lt;typename T&gt; cell tag(T *value)&lt;br /&gt;{&lt;br /&gt; return RETAG(value,tag_for(T::type_number));&lt;br /&gt;}&lt;/pre&gt;&lt;br /&gt;That's it; one function for all data types. It is used like so:&lt;br /&gt;&lt;pre&gt;array *a = ...;&lt;br /&gt;cell c = tag&lt;array&gt;(a);&lt;/pre&gt;&lt;br /&gt;How does it work? The &lt;code&gt;F_ARRAY&lt;/code&gt; struct is now the &lt;code&gt;array&lt;/code&gt; class, and it has a static variable &lt;code&gt;type_number&lt;/code&gt;:&lt;br /&gt;&lt;pre&gt;struct array : public object&lt;br /&gt;{&lt;br /&gt;    static const cell type_number = 2;&lt;br /&gt;    ...&lt;br /&gt;}&lt;/pre&gt;&lt;br /&gt;How do hi-tag pointers get handled? Well, their type numbers are all greater than or equal to 8. The &lt;code&gt;tag_for()&lt;/code&gt; function checks for this, and returns the object tag number of it is the case. Since it is an inline function, the computation is constant-folded at compile time, and the generated code is as efficient as the old C version, only more type-safe and with less boilerplate.&lt;br /&gt;&lt;br /&gt;For untagging tagged pointers, I use a more complicated scheme. I have a templated "smart pointer" class representing a tagged pointer with an associated data type. Here is a simplified version of the class; this omits the type checking logic and a bunch of other utility methods; if you want the gory details, the full source is in &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=vm/tagged.hpp;hb=HEAD"&gt;tagged.hpp&lt;/a&gt;.&lt;br /&gt;&lt;pre&gt;template &amp;lt;typename T&gt;&lt;br /&gt;struct tagged&lt;br /&gt;{&lt;br /&gt;    cell value_;&lt;br /&gt;    T *untagged() const { return (T *)(UNTAG(value_)); }&lt;br /&gt;    T *operator-&gt;() const { return untagged(); }&lt;br /&gt;    explicit tagged(T *untagged) : value_(factor::tag(untagged)) { }&lt;br /&gt;}&lt;/pre&gt;&lt;br /&gt;The overload of &lt;code&gt;operator-&gt;&lt;/code&gt; means that tagged pointers can be used like ordinary pointers on some cases without needing to be untagged first. Otherwise, a pointer stored in a &lt;code&gt;cell&lt;/code&gt; can be untagged like so:&lt;br /&gt;&lt;pre&gt;array *a = tagged&lt;array&gt;(c).untagged()&lt;/pre&gt;&lt;br /&gt;I made a utility function to encapsulate this type of thing, for when I just want to untag a value once and not pass it around as a smart pointer:&lt;br /&gt;&lt;pre&gt;template &lt;typename T&gt; T *untag(cell value)&lt;br /&gt;{&lt;br /&gt; return tagged&lt;T&gt;(value).untagged();&lt;br /&gt;}&lt;/pre&gt;&lt;br /&gt;Now I can write&lt;br /&gt;&lt;pre&gt;array *a = untag&amp;lt;array&gt;(c)&lt;/pre&gt;&lt;br /&gt;This compiles down to the same machine code as the C version but it is more type-safe and less verbose.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Registration of local variables with the garbage collector&lt;/h3&gt;&lt;br /&gt;Any object allocation can trigger a GC, and the GC must be able to identify all live values at that point. For compiled Factor code, this is pretty easy. The GC needs to scan a few globals, the data and retain stacks, as well as the call stack; call frames for Factor code have a known format. The compiler does store objects in registers, but it moves them to the stack before calling into the GC.&lt;br /&gt;&lt;br /&gt;However, the VM code itself uses the garbage collector to allocate its own internal state. This makes the VM code more robust since it doesn't need to worry about when to deallocate memory, but it presents a problem: if a C local variable points into the GC heap, how does the GC know that the value is live, and how does the GC know to update the pointer of the value got moved (Factor's GC is a generational &lt;a href="http://en.wikipedia.org/wiki/Cheney's_algorithm"&gt;semi-space&lt;/a&gt; collector so objects move around a lot). Between function calls, compiled C code saves pointers in the call frame. However, since I have no control over the machine code gcc generates, I cannot force call stack frames into a certain format that the GC can understand, like I do with compiled Factor code.&lt;br /&gt;&lt;br /&gt;One solution is &lt;a href="http://en.wikipedia.org/wiki/Garbage_collection_(computer_science)"&gt;conservative garbage collection&lt;/a&gt;. The garbage collector can just pessimistically assume that &lt;i&gt;any&lt;/i&gt; integer variable on the call stack might point into the GC heap. There are two problems with this approach. The first is if you have a bit pattern that happens to look like a pointer, but is really just a random number, you might keep a value live more than it is really needed. From what I've heard, this is mostly just a theoretical concern. A more pressing issue is that since you do not know what is a real pointer and what isn't, you cannot move objects that might be referenced from the call stack. This can lead to fragmentation and more expensive allocation routines.&lt;br /&gt;&lt;br /&gt;The technique I've been using until now in the C VM code was to have a pair of macros, &lt;code&gt;REGISTER_ROOT&lt;/code&gt; and &lt;code&gt;UNREGISTER_ROOT&lt;/code&gt;. These push and pop the address of a local variable on a "GC root stack". So by traversing the GC root stack, the garbage collector knows which call stack locations refer to objects in the GC heap, and which are just integers. The disadvantage of this approach is that pairing root registration was verbose and error-prone, and forgetting to regsiter a root before calling a function which may potentially allocate memory would lead to hard-to-discover bugs that would be triggered rarely and result in memory corruption.&lt;br /&gt;&lt;br /&gt;Using C++ language features, I instead implemented a "smart pointer" &lt;code&gt;gc_root&lt;/code&gt; class. It subclasses the &lt;code&gt;tagged&lt;/code&gt; smart pointer I described above, with the additional functionality in the constructor and destructor to register and unregister a pointer to the wrapped pointer as a GC root. Together with the overloaded &lt;code&gt;operator-&gt;&lt;/code&gt; this makes for much clearer and safer code. By updating existing code to use this new smart pointer I found a few places in the old C code where I had forgotten to register roots. No doubt one or two users saw Factor crash because of these. You can look at the &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=vm/local_roots.hpp;h=e074d999e7f221b17066a2ed8d8661d52b7919d4;hb=HEAD"&gt;source code&lt;/a&gt; for the GC root abstraction.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Object-oriented abstractions&lt;/h3&gt;&lt;br /&gt;Most functions in the VM are still top-level C-style functions, however in a few places I've found good reason to use methods and even inheritance. I've avoided virtual methods so far. One place where inheritance has really helped clean up some crufty old C code is in what I call the &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=vm/jit.cpp;hb=HEAD"&gt;JIT&lt;/a&gt;. This is a simple native code generator that the VM uses to fill in the gaps for the &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=tree;f=basis/compiler;hb=HEAD"&gt;optimizing compiler&lt;/a&gt; that's written in Factor. The JIT is used to generate machine code in the following situations:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=vm/quotations.cpp;hb=HEAD"&gt;The non-optimizing compiler, used during bootstrap and to evaluate listener input&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=vm/inline_cache.cpp;hb=HEAD"&gt;Polymorphic inline caching&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=vm/profiler.cpp;hb=HEAD"&gt;The profiler&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;The old design was built out of C macros and code duplication. Now there is a &lt;code&gt;jit&lt;/code&gt; class with various methods to emit machine code. Subclasses of &lt;code&gt;jit&lt;/code&gt;, such as &lt;code&gt;quotation_jit&lt;/code&gt; and &lt;code&gt;inline_cache_jit&lt;/code&gt;, add additional functionality. The constructor and destructor take care of registering and unregistering the objects used by the JIT with the GC.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Functionality moved out of the VM&lt;/h3&gt;&lt;br /&gt;Since I prefer programming in Factor over low-level languages such as C and C++, I try to keep the VM small. A couple of things were in the VM for historical reasons only. The complex number and rational number data types, for instance, were defined in the VM, even though operations on them were in Factor code. This is because the number tower predated user-defined tuple types by a year or so. Nowadays there is no reason not to use tuples to define these types, so I made the change, simplifying the VM. Another piece of functionality that no longer had a good reason to be in the VM was some vestigial code for converting Factor's Unicode string type to and from Latin-1 and UCS-2 encodings. Ever since Daniel Ehrenberg implemented the &lt;a href="http://docs.factorcode.org/content/article-io.encodings.html"&gt;io.encodings&lt;/a&gt; library in Factor, most string encoding and decoding was done in library code, and a lot more encodings were implemented; UTF8, Shift-JIS, MacRoman, EBCDIC, etc. Only a couple of things still relied on the VM's broken built-in string encoding. They're now using the library just like everything else.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Code heap compaction&lt;/h3&gt;&lt;br /&gt;Factor allocates compiled machine code in their own heap, separate from the data heap. A &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob;f=vm/code_gc.cpp;hb=HEAD"&gt;mark/sweep/compact garbage collector&lt;/a&gt; manages the code heap. When performing a compaction, the code heap garbage collector needs to maintain an association from a code block's old address to its new address. Previously this would be stored in the code block's header, wasting 4 bytes (or 8 bytes on 64-bit platforms) per code block. Now I use &lt;a href="http://www.codeguru.com/cpp/cpp/cpp_mfc/stl/article.php/c15303__1/"&gt;std::tr1::unordered_map&lt;/a&gt; to do this. Of course I could've written a hashtable in C, or used an existing implementation, but using STL is more convenient.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Future improvements&lt;/h3&gt;&lt;br /&gt;I don't spend much time working on the Factor VM, since most of the action is in the compiler, library, and UI, which are all written in Factor and not C++. However, at some point I'm going to do more work on the garbage collector, and I will do some more cleanups at that time:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Eliminate usages of GNU C extensions. I want to be able to compile the Factor VM with other compilers, such as Microsoft's Visual Studio, and LLVM's Clang.&lt;/li&gt;&lt;li&gt;Eliminating global variables&lt;/li&gt;&lt;li&gt;Using &lt;code&gt;new&lt;/code&gt;/&lt;code&gt;delete&lt;/code&gt; instead of &lt;code&gt;malloc()&lt;/code&gt; and &lt;code&gt;free()&lt;/code&gt; for the small set of data that the VM allocates in unmanaged memory&lt;/li&gt;&lt;li&gt;Various other cleanups, such as making better use of method invocation and templates&lt;/li&gt;&lt;li&gt;There's some code duplication in the garbage collector I hope to address with C++ templates&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Conclusion&lt;/h3&gt;&lt;br /&gt;A lot of these cleanups and improvements could have been achieved in C, with clever design and various preprocessor macro tricks. However, I think C++ offers a better set of abstractions for the problem domain that the Factor VM lives in, and since I only use the language features that have no runtime penalty (I'm not using C++ exceptions, virtual methods, or RTTI) there's no performance decrease either. I'm very happy with how the code cleanup turned out. Making the VM easier to understand and maintain will help with implementing more advanced garbage collection algorithms, among other things.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17087850-504950899931887926?l=factor-language.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://factor-language.blogspot.com/feeds/504950899931887926/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17087850&amp;postID=504950899931887926' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/504950899931887926'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/504950899931887926'/><link rel='alternate' type='text/html' href='http://factor-language.blogspot.com/2009/05/factor-vm-ported-to-c.html' title='Factor VM ported to C++'/><author><name>Slava Pestov</name><uri>http://www.blogger.com/profile/02768382790667979877</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17087850.post-3022312706421649155</id><published>2009-04-18T18:42:00.005-04:00</published><updated>2009-04-19T04:02:47.955-04:00</updated><title type='text'>Recent UI improvements</title><content type='html'>Over the last few months I've done a lot of work on &lt;a href="http://docs.factorcode.org/content/article-ui.html"&gt;Factor's UI toolkit&lt;/a&gt;, and the developer tools built on top. Here is an overview of what's changed.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Code cleanup&lt;/h3&gt;&lt;br /&gt;I've been working on the UI since 2005, and it has accumulated a lot of cruft since then. Language features have come and gone and not enough refactoring was done over time. Last year I rewrote the HTTP server and compiler using the latest language features and the result was very clean code. The UI is now written in the same style.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Improved font support&lt;/h3&gt;&lt;br /&gt;I've blogged about this:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://factor-language.blogspot.com/2009/01/rendering-unicode-text-to-opengl.html"&gt;Mac OS X&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://factor-language.blogspot.com/2009/04/rendering-text-on-windows-via-uniscribe.html"&gt;Windows&lt;/li&gt;&lt;li&gt;&lt;a href="http://factor-language.blogspot.com/2009/03/rendering-unicode-text-with-pango-and.html"&gt;X11&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;All UI gadgets that render text now use the appropriate platform-specific mechanism for rendering text. The &lt;a href="http://docs.factorcode.org/content/vocab-ui.text.html"&gt;ui.text&lt;/a&gt; vocabulary provides the cross-platform abstraction layer. The &lt;a href="http://docs.factorcode.org/content/vocab-fonts.html"&gt;fonts&lt;/a&gt; vocabulary defines data types for fonts and font metrics. This replaces ad-hoc loosely-typed "font specifiers" that the UI used formerly, where a font was just a triple; &lt;code&gt;{ "sans-serif" plain 12 }&lt;/code&gt; for instance. Font metrics are now supported in a cross-platform manner too. All relevant gadgets now do &lt;a href="http://factor-language.blogspot.com/2009/02/font-metrics-and-baseline-alignment.html"&gt;baseline alignment&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Image support&lt;/h3&gt;&lt;br /&gt;The UI now supports an easy API for displaying images. The &lt;a href="http://docs.factorcode.org/content/vocab-ui.images.html"&gt;ui.images&lt;/a&gt; vocabulary defines some words which can be called from your gadget's &lt;a href="http://docs.factorcode.org/content/word-draw-gadget__star__%2cui.render.html"&gt;draw-gadget*&lt;/a&gt; method. The &lt;a href="http://docs.factorcode.org/content/vocab-ui.gadgets.icons.html"&gt;ui.gadgets.icons&lt;/a&gt; vocabulary defines a simple gadget that renders an image and nothing else. This is all built on top of the &lt;a href="http://docs.factorcode.org/content/vocab-images.html"&gt;images&lt;/a&gt; vocabulary, which implements BMP and TIFF loaders in pure Factor. Hopefully we'll get PNG and JPEG at some point too.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Editor gadget improvements&lt;/h3&gt;&lt;br /&gt;The &lt;a href="http://docs.factorcode.org/content/vocab-ui.gadgets.editors.html"&gt;ui.gadgets.editors&lt;/a&gt; vocabulary supports some new features:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Undo and redo (Control-z, Control-Shift-z)&lt;/li&gt;&lt;li&gt;Join lines (Control-j)&lt;/li&gt;&lt;li&gt;Character and word navigation now uses &lt;a href="http://docs.factorcode.org/content/vocab-unicode.breaks.html"&gt;unicode.breaks&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;h3&gt;Table gadget replaces list gadget&lt;/h3&gt;&lt;br /&gt;The &lt;a href="http://docs.factorcode.org/content/vocab-ui.gadgets.tables.html"&gt;ui.gadgets.tables&lt;/a&gt; vocabulary implements a new gadget which is a replacement for the old list gadget. It's functionality is a superset of what lists offered. Tables are used extensively by the new developer tools to display completion popups and such. Unlike lists, which would create a gadget for every row of data, tables do not have any children and render rows directly. This reduces memory consumption. Tables take the list of rows from an underlying model, and place the currently selected row in another model. The latter model can then be wrapped in an arrow model to compute a new list of rows, which can then be the model for another table. This way, multiple tables can be chained together for a 'drill down navigation'-style of interaction with very little code.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Scroller gadget improvements&lt;/h3&gt;&lt;br /&gt;Scroller gadgets, defined in the &lt;a href="http://docs.factorcode.org/content/vocab-ui.gadgets.scrollers.html"&gt;ui.gadgets.scrollers&lt;/a&gt; vocabulary, now provide a protocol that their children can implement. This protocol has two generic words:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://docs.factorcode.org/content/word-pref-viewport-dim%2cui.gadgets.scrollers.html"&gt;viewport-pref-dim&lt;/a&gt; - this allows the child to specify the preferred dimensions of the scroller surrounding the gadget. Tables and editors implement this generic word, allowing you to set various slots, such as &lt;code&gt;min-rows&lt;/code&gt; and &lt;code&gt;max-rows&lt;/code&gt;, which control the size of the scroller surrounding the gadget in character units.&lt;/li&gt;&lt;li&gt;&lt;a href="http://docs.factorcode.org/content/word-viewport-column-header%2cui.gadgets.scrollers.html"&gt;viewport-column-header&lt;/a&gt; - constructs a gadget which is displayed at the top of the scroller. This allows the table gadget to display column titles which are always visible, regardless of the vertical scroll position.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Frame layout improvements&lt;/h3&gt;&lt;br /&gt;The &lt;a href="http://docs.factorcode.org/content/vocab-ui.gadgets.frames.html"&gt;ui.gadgets.frames&lt;/a&gt; vocabulary is more flexible now. Instead of limiting frame layouts to a 3x3 grid with the center item stretched to fill available space, the grid can be of any size, and an arbitrary cell can be designated to take up available space.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Models API changes&lt;/h3&gt;&lt;br /&gt;The &lt;a href="http://docs.factorcode.org/content/vocab-models.html"&gt;models&lt;/a&gt; vocabulary has undergone some cleanups. To avoid confusion with sequence functionality, filter models are now called arrow models. Compose models are now called product models, since really they represent a cartesian product of functions, not a composition of anything at all. The new &lt;a href="http://docs.factorcode.org/content/vocab-models.arrow.smart.html"&gt;models.arrow.smart&lt;/a&gt; vocabulary wrap an arrow around a product whose arity is the arity of the arrow's quotation.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Integration between event loop and I/O&lt;/h3&gt;&lt;br /&gt;Until Factor supports real native threads, any FFI call blocks all Factor threads until the call completes. I/O is done with non-blocking APIs on Unix (epoll, kqueue) and Windows (IO completion ports), so socket operations can proceed concurrently. However, until recently, the UI would have to poll for events because of this, because blocking calls to wait for events would prevent I/O and other Factor threads from running. This meant that even if you weren't doing anything in the Factor UI, it would use a little bit of CPU - 5% to 25%. This problem has been fixed. On Mac OS X, I use the &lt;ca href="http://developer.apple.com/documentation/CoreFoundation/Reference/CFFileDescriptorRef/index.html"&gt;CFFileDescriptor&lt;/a&gt; API to wait on a &lt;code&gt;kqueue&lt;/code&gt; and GUI events at the same time; you can provide a callback which is invoked when file descriptors are ready for I/O, and call &lt;code&gt;[NSApplication run]&lt;/code&gt; as usual. On X11, I use &lt;code&gt;XConnectionNumber()&lt;/code&gt; to get a file descriptor out of a &lt;code&gt;Display&lt;/code&gt; and call &lt;a href="http://docs.factorcode.org/content/word-wait-for-fd%2cio.backend.unix.html"&gt;wait-for-fd&lt;/a&gt; to wait for events to arrive while also allowing other Factor threads to run. This solves the CPU usage problem and allows Factor threads to run concurrently with the event loop waiting for events.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Improvements to tooling&lt;/h3&gt;&lt;br /&gt;The main focus of my recent UI work, however, has been on improving &lt;a href="http://docs.factorcode.org/content/article-ui-tools.html"&gt;UI developer tools&lt;/a&gt;. I plan on making a screencast to highlight the new improvements soon.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17087850-3022312706421649155?l=factor-language.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://factor-language.blogspot.com/feeds/3022312706421649155/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17087850&amp;postID=3022312706421649155' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/3022312706421649155'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/3022312706421649155'/><link rel='alternate' type='text/html' href='http://factor-language.blogspot.com/2009/04/recent-ui-improvements.html' title='Recent UI improvements'/><author><name>Slava Pestov</name><uri>http://www.blogger.com/profile/02768382790667979877</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17087850.post-8094669145232101621</id><published>2009-04-08T03:21:00.004-04:00</published><updated>2009-04-08T03:52:06.607-04:00</updated><title type='text'>OpenGL textures and the power-of-two size restriction</title><content type='html'>Prior to version 2.0, the OpenGL specification required that texture dimensions be powers of two. This simplifies the implementation of texture mapping, because converting floating point texture co-ordinates (which are in the range 0..1) to texel coordinates is trivial; multiplying a floating point number by a power of two is essentially just adding a number to the exponent. Eventually, more capable drivers and graphics cards came along, and introduced the ability to use non-power-of-two texture dimensions. To signal this capability to GL applications, they report supporting the &lt;a href="http://www.opengl.org/registry/specs/ARB/texture_non_power_of_two.txt"&gt;GL_ARGB_texture_non_power_of_two&lt;/a&gt; extension. OpenGL 2.0 implementations are required to support this extension.&lt;br /&gt;&lt;br /&gt;In practice, the only major OpenGL implementations which don't provide this extension are older X11 drivers, and the Microsoft Windows software renderer, which is a very bare-bones OpenGL 1.1 implementation.&lt;br /&gt;&lt;br /&gt;There is a trick for padding textures up to a power of two for implementations which don't support this extension, however it doesn't seem to work everywhere either. Instead of manipulating the bitmap in software before passing it onto a call to &lt;a href="http://www.opengl.org/documentation/specs/man_pages/hardcopy/GL/html/gl/teximage2d.html"&gt;glTexImage2D()&lt;/a&gt;, it is permissible as of OpenGL 1.1 to pass a bitmap pointer of &lt;code&gt;NULL&lt;/code&gt;. This creates a texture with uninitialized content. The &lt;a href="http://www.opengl.org/documentation/specs/man_pages/hardcopy/GL/html/gl/texsubimage2d.html"&gt;glTexSubImage2D()&lt;/a&gt; function is used to fill it portions of the new texture. In particular, &lt;code&gt;glTexSubImage2D()&lt;/code&gt; places no restriction on the width and height, even if &lt;code&gt;GL_ARGB_texture_non_power_of_two&lt;/code&gt; is not supported.&lt;br /&gt;&lt;br /&gt;The above trick works with the Windows software renderer. On the other hand, previous-generation MacBooks with Intel graphics suffer from a driver bug which results in artifacts appearing when this feature is used to render scaled textures. However, all OpenGL implementations in recent Mac OS X releases support non-power-of-2 textures, so on this platform, the workaround can be avoided entirely anyway.&lt;br /&gt;&lt;br /&gt;In Factor, the &lt;a href="http://docs.factorcode.org/content/vocab-opengl.capabilities.html"&gt;opengl.capabilities&lt;/a&gt; vocabulary provides some utility words to check for extensions. For example, a common operation is checking for either a specific OpenGL version, or an extension (new versions of the GL spec frequently absorb existing extensions):&lt;br /&gt;&lt;pre&gt;"2.0" { "GL_ARB_texture_non_power_of_two" } has-gl-version-or-extensions?&lt;/pre&gt;&lt;br /&gt;The &lt;code&gt;gl-extensions&lt;/code&gt; word outputs a sequence of all supported extensions. &lt;a href="http://paste.factorcode.org/paste?id=560"&gt;Here is the output from the Mesa software renderer on Linux&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;I replaced my old code in &lt;a href="http://docs.factorcode.org/content/vocab-opengl.textures.html"&gt;opengl.textures&lt;/a&gt; which padded &lt;a href="http://docs.factorcode.org/content/vocab-images.html"&gt;bitmap image objects&lt;/a&gt; out to powers of two using sequence manipulation words, with the new way using texture sub-images instead. If the extension is present, no padding is performed, ensuring correct behavior on Mac OS X. This means that any code using &lt;code&gt;opengl.textures&lt;/code&gt;, such as the UI's text rendering and image support, should now spend less CPU time running Factor code.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://docs.factorcode.org/content/vocab-opengl.html"&gt;Factor's OpenGL binding&lt;/a&gt; has been in development for 4 years and has seen contribution from 4 developers. For a demo of what it can do, try &lt;code&gt;"spheres" run&lt;/code&gt; in the UI listener. You will need a video driver that supports OpenGL 2.0 or &lt;code&gt;GL_ARB_shader_objects&lt;/code&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17087850-8094669145232101621?l=factor-language.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://factor-language.blogspot.com/feeds/8094669145232101621/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17087850&amp;postID=8094669145232101621' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/8094669145232101621'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/8094669145232101621'/><link rel='alternate' type='text/html' href='http://factor-language.blogspot.com/2009/04/opengl-textures-and-power-of-two-size.html' title='OpenGL textures and the power-of-two size restriction'/><author><name>Slava Pestov</name><uri>http://www.blogger.com/profile/02768382790667979877</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17087850.post-2898584695088161249</id><published>2009-04-07T00:33:00.003-04:00</published><updated>2009-04-07T01:18:13.897-04:00</updated><title type='text'>Rendering text on Windows via Uniscribe</title><content type='html'>My original goal was to use Pango to render Unicode text in the Factor UI on Windows. This didn't pan out, for a number of reasons:&lt;br /&gt;&lt;ul&gt;&lt;br /&gt;&lt;li&gt;Pango, Cairo and all of their dependencies add up to around 7Mb of DLLs that we'd need to ship with every Factor binary, as well as every standalone app binary deployed from Factor. That's too much when a 'Hello world' compiles down to a 500kb image.&lt;/li&gt;&lt;li&gt;Pango has various bugs on 64-bit Windows.&lt;/li&gt;&lt;li&gt;Pango doesn't play well with Microsoft's ClearType, and anti-aliased text is clipped for some reason.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;I'm sure these problems will get fixed eventually (except for the first one perhaps) and Pango is open source so I could always send a patch, but I'd rather spend my time working on interesting things instead, so I've decided to bypass Pango on Windows altogether and use Microsoft's native &lt;a href="http://www.microsoft.com/typography/developers/uniscribe/"&gt;Uniscribe API&lt;/a&gt; instead. Uniscribe ships as a standard part of Windows XP (actually it's been around since IE 5.0) and so Factor can depend on it being installed. On X11, I continue to use Pango; most *nix systems have at least part of the GNOME platform installed, so Pango and Cairo will be there, and I haven't seen any rendering issues in Pango with X11 either.&lt;br /&gt;&lt;br /&gt;I'm using the &lt;a href="http://msdn.microsoft.com/en-us/library/dd374124(VS.85).aspx"&gt;Uniscribe script string&lt;/a&gt; API, which is intended to be used for laying out and rendering a piece of text with a single font and color. It is directly analogous to &lt;code&gt;CTLine&lt;/code&gt; in Core Text and &lt;code&gt;PangoLayout&lt;/code&gt; in Pango.&lt;br /&gt;&lt;br /&gt;The function to create a script string, &lt;a href="http://msdn.microsoft.com/en-us/library/dd368566(VS.85).aspx"&gt;ScriptStringAnalyze&lt;/a&gt;, takes a device context handle as a parameter. The device context must be provided if the string is going to be rendered or measured.&lt;br /&gt;&lt;br /&gt;Before creating the string, the font and text color are set in the device content. To set the font, I look up a font handle with &lt;a href="http://msdn.microsoft.com/en-us/library/dd183499.aspx"&gt;CreateFont&lt;/a&gt;, and pass it to the &lt;a href="http://msdn.microsoft.com/en-us/library/dd162957(VS.85).aspx"&gt;SelectObject()&lt;/a&gt; function. There's an important caveat with &lt;code&gt;CreateFont()&lt;/code&gt;; if you want your font size to be specified in points (rather than pixels), you must pass a negative size. I noticed that 12-point font was rendered way too small; changing the 12 to a -12 fixed the problem, so now the Factor UI does this for you on Windows.&lt;br /&gt;&lt;br /&gt;The text color is set by calling &lt;a href="http://msdn.microsoft.com/en-us/library/dd145093(VS.85).aspx"&gt;SetTextColor&lt;/a&gt;, and the background color (which is only rendered if a flag is passed to &lt;code&gt;ScriptStringOut()&lt;/code&gt;; see below) with &lt;a href="http://msdn.microsoft.com/en-us/library/dd162964(VS.85).aspx"&gt;SetBkColor&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;For both fonts and colors, the Uniscribe text rendering uses Factor's cross-platform &lt;a href="http://docs.factorcode.org/content/article-fonts.html"&gt;font&lt;/a&gt; and &lt;a href="http://docs.factorcode.org/content/article-colors.html"&gt;color&lt;/a&gt; types.&lt;br /&gt;&lt;br /&gt;Size measurement is done by calling &lt;a href="http://msdn.microsoft.com/en-us/library/dd368576(VS.85).aspx"&gt;ScriptString_pSize()&lt;/a&gt;. Unlike Core Text and Pango, Windows Uniscribe makes no distinction between metric and ink bounds. There is a provision for oversize glyphs, such as those in the Zapfino font (see my blog post on &lt;a href="http://factor-language.blogspot.com/2009/02/metric-bounds-versus-image-bounds.html"&gt;ink and metric bounds&lt;/a&gt;), where each glyph has an associated &lt;a href="http://msdn.microsoft.com/en-us/library/dd144857(VS.85).aspx"&gt;ABC metrics&lt;/a&gt; structure. I'll implement support for this later.&lt;br /&gt;&lt;br /&gt;When rendering to an offscreen context that is intended to be used as an OpenGL texture, as in the Factor UI, a bit of a chicken-and-egg problem occurs, because we need to know the size of the resulting text before we can create a bitmap. Fortunately, Windows GDI separates the process of creating a memory (off-screen) DC and allocating the bitmap storage for it, into two functions, &lt;a href="http://msdn.microsoft.com/en-us/library/aa922550.aspx"&gt;CreateCompatibleDC()&lt;/a&gt; and &lt;a href="http://msdn.microsoft.com/en-us/library/aa926298.aspx"&gt;CreateDIBSection()&lt;/a&gt;. The trick is to create a DC, create a script string, get its size, then allocate the bitmap for the DC, and finally render the string.&lt;br /&gt;&lt;br /&gt;Script strings can be rendered to their underlying DC with the &lt;a href="http://msdn.microsoft.com/en-us/library/dd368571(VS.85).aspx"&gt;ScriptStringOut&lt;/a&gt; function. This function takes a number of parameters. For instance, it can render the text selection for you.&lt;br /&gt;&lt;br /&gt;Font metrics -- the ascent, descent, and leading -- can be obtained by calling &lt;a href="http://msdn.microsoft.com/en-us/library/dd144941(VS.85).aspx"&gt;GetTextMetrics()&lt;/a&gt; on the DC.&lt;br /&gt;&lt;br /&gt;Once the text has been rendered into a memory DC, the underlying bitmap needs to be obtained and the graphics object handles freed. This is the third time I implement a similar-looking &lt;code&gt;make-bitmap-image&lt;/code&gt; combinator, so by now I've developed some utilities that I've abstracted out into the &lt;code&gt;images.memory&lt;/code&gt; vocabulary. The bitmap is cached as a texture in the same way on all platforms; I've discussed &lt;a href="http://factor-language.blogspot.com/2009/02/implementing-opengl-texture-caching-and.html"&gt;OpenGL texture caching&lt;/a&gt; before.&lt;br /&gt;&lt;br /&gt;Converting between x co-ordinates and line offsets, and vice versa, can be done with &lt;a href="http://msdn.microsoft.com/en-us/library/dd368573(VS.85).aspx"&gt;ScriptStringXToCP()&lt;/a&gt; and &lt;a href="http://msdn.microsoft.com/en-us/library/dd368567(VS.85).aspx"&gt;ScriptStringCpToX()&lt;/a&gt;. Watch out for the fact that passing the length of the string to &lt;code&gt;ScriptStringCpToX()&lt;/code&gt; is invalid; if you want the X-offset of the trailing edge of the last code point, you have to pass &lt;code&gt;length-1&lt;/code&gt; instead, with the &lt;code&gt;fTrailing&lt;/code&gt; parameter &lt;code&gt;TRUE&lt;/code&gt;.&lt;br /&gt;&lt;br /&gt;Finally, script strings are freed by calling &lt;a href="http://msdn.microsoft.com/en-us/library/dd368568(VS.85).aspx"&gt;ScriptStringFree()&lt;/a&gt;. When binding to the API from another language, watch out for the type of the input parameter; it's a pointer to a &lt;code&gt;SCRIPT_STRING_ANALYSIS&lt;/code&gt; type, which is itself a pointer type. So you're passing a pointer to a pointer to a struct! I got the type wrong in my FFI declaration the first time around and was getting segfaults when deallocating stuff. Very annoying.&lt;br /&gt; &lt;br /&gt;The code implementing this is split between four vocabularies:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob_plain;f=basis/windows/offscreen/offscreen.factor;hb=HEAD"&gt;windows.offscreen&lt;/a&gt; - support code for creating GDI memory DCs. Originally written by Joe Groff for another purpose; I've generalized it and put it in a common location so that it can be used by both the Uniscribe code and offscreen gadget rendering.&lt;/li&gt;&lt;li&gt;&lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob_plain;f=basis/windows/fonts/fonts.factor;hb=HEAD"&gt;windows.fonts&lt;/a&gt; - looking up GDI font handles from Factor font descriptions&lt;/li&gt;&lt;li&gt;&lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob_plain;f=basis/windows/uniscribe/uniscribe.factor;hb=HEAD"&gt;windows.uniscribe&lt;/a&gt; - the bulk of the code; creating, rendering script strings, converting x co-ordinates to line offsets and vice versa&lt;/li&gt;&lt;li&gt;&lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob_plain;f=basis/ui/text/uniscribe/uniscribe.factor;hb=HEAD"&gt;ui.text.uniscribe&lt;/a&gt; - a backend for the UI's cross-platform text rendering support. Just a thin shim over the preceding vocabularies.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;Here is what it looks like when all is said and done:&lt;br /&gt;&lt;img src="http://factorcode.org/uniscribe.png"&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17087850-2898584695088161249?l=factor-language.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://factor-language.blogspot.com/feeds/2898584695088161249/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17087850&amp;postID=2898584695088161249' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/2898584695088161249'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17087850/posts/default/2898584695088161249'/><link rel='alternate' type='text/html' href='http://factor-language.blogspot.com/2009/04/rendering-text-on-windows-via-uniscribe.html' title='Rendering text on Windows via Uniscribe'/><author><name>Slava Pestov</name><uri>http://www.blogger.com/profile/02768382790667979877</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17087850.post-6761050080139200802</id><published>2009-04-01T04:03:00.007-04:00</published><updated>2009-04-12T15:19:25.970-04:00</updated><title type='text'>Sup dawg, we heard you like Smalltalk so we put Smalltalk in your Factor so you can send messages while you roll</title><content type='html'>A week ago, I went to the &lt;a href="http://us.pycon.org/2009/about/summits/vm/"&gt;PyCon VM summit&lt;/a&gt; in Chicago. It was great fun and I got to meet various Internet celebrities such as John Rose (Sun JVM), Evan Phoenix (Rubinius), Charles Nutter (JRuby), Avi Bryant (Seaside), and Allison Randal (Parrot).&lt;br /&gt;&lt;br /&gt;Factor's own &lt;a href="http://useless-factor.blogspot.com"&gt;Daniel Ehrenberg&lt;/a&gt; and &lt;a href="http://code-factor.blogspot.com"&gt;Doug Coleman&lt;/a&gt; were also there, and I've heard rumors that the world-famous C++ programmer &lt;a href="http://duriansoftware.com"&gt;Joe Groff&lt;/a&gt; may have tagged along for the ride.&lt;br /&gt;&lt;br /&gt;Every VM implementor gave a short 10 minute talk. Mine was an overview of the optimizations and implementation techniques used in Factor; you can look at the slides in a recent Factor by running &lt;code&gt;"chicago-talk" run&lt;/code&gt; in the listener. A common theme at the summit was the distinction between platforms versus languages. The JVM, Parrot and PyPy people are all vying for language designers to implement their languages on top of these VMs. The idea is that the platform developers take care of the "hard" stuff such as the optimizing compiler, garbage collector, and libraries, whereas the language designer feeds a grammar into a parser generator, and bingo, instant programming language.&lt;br /&gt;&lt;br /&gt;While discussing Smalltalk minutiae with Dan and Avi Bryant, we hit upon the idea of trying to compile Smalltalk down to Factor. After all, Factor already has everything you'd want in a platform for hosting languages. We have a great optimizing compiler, a generational garbage collector, C FFI, a ton of libraries, and so on; in theory we've already done most of the hard work for someone who wants to make a language without dealing with low-level details. So while we've written countless DSLs in Factor since the very beginning, why not try implementing a more complete language? &lt;br /&gt;&lt;br /&gt;That's exactly what I did, and the code is in the repository under &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=tree;f=extra/smalltalk;hb=HEAD"&gt;extra/smalltalk&lt;/a&gt;. Run it by issuing the following command in the Factor listener:&lt;br /&gt;&lt;pre&gt;"smalltalk.listener" run&lt;/pre&gt;&lt;br /&gt;Then try a Smalltalk expression such as &lt;code&gt;1 to: 10 do: [:i|i print].&lt;/code&gt;, or look at &lt;a href="http://gitweb.factorcode.org/gitweb.cgi?p=factor/.git;a=blob_plain;f=extra/smalltalk/eval/fib.st;hb=HEAD"&gt;extra/smalltalk/eval/fib.st&lt;/a&gt; for a more complete example. Incidentally, this Smalltalk fib runs much slower than a Factor fib, because of Smalltalk return semantics; I discuss the problem below as well as my planned solution to eliminate the overhead.&lt;br /&gt;&lt;br /&gt;In theory, there is a simple mapping from Smalltalk to Factor:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;A Smalltalk class can be a Factor tuple class. Since ivars are pre-declared, the performance complications that Ruby and Python implementations must deal with are avoided.&lt;/li&gt;&lt;li&gt;Every Smalltalk selector can map into a Factor generic word; all selectors will be defined in a single vocabulary.&lt;/
