Friday, February 01, 2008

24-bit strings are in

I implemented the new strings which start off as strings of octets and upgrade transparently to 24-bit strings capable of representing all Unicode 5.0 code points. Dan's Unicode library is almost finished, too.

Many languages shun integration with operating systems and existing standards such as Unicode, HTTP, XML, and so on, because these standards are "dirty" and are beneath them. I think this philosophy is bullshit; it is just an attempt to pass off incompetence as elitism.

Factor is all about "embracing and extending" existing standards: we inter-operate with your existing software but allow you to build something better on top.

4 comments:

Anonymous said...

Many languages shun integration with operating systems and existing standards such as Unicode, HTTP, XML, and so on, because these standards are "dirty" and are beneath them. I think this philosophy is bullshit; it is just an attempt to pass off incompetence as elitism.

Was that a hidden reference to Arc? ;-)

Anyway, I love to see that there is finally somebody who is able to abstract himself from specific encodings, and seeing Unicode strings as what they are: a sequence of Unicode code points.

Anonymous said...

AFAIK Unicode requires 21 bits to store code points. Why did you choose 24 rather than 32 bits for a unicode string type? I understand memory compactness concerns, but what about unaligned memory access performance? Do you have some benchmarks you could share?

Slava Pestov said...

Anonymous: all access is aligned, please see http://factor-language.blogspot.com/2008/01/some-things-ive-been-working-on.html

I chose this representation over UTF32 because the former compresses down to a string of octets if the are no code points > 255 in the string.

Anonymous said...

Nice story as for me. I'd like to read more concerning this topic.
By the way look at the design I've made myself High class escort