June 16, 2004

Strings, revisited

So, I finally did the last draft of the bytecode/assembly level string design for Parrot. It was a mixed bag--the per-string language tag is gone (darn!) but national character sets stay (yay!) with a set of "It's all Unicode no matter what you say" string ops thrown into the mix. Like any other engineering task with multiple conflicting requirements and strong proponents of different schemes, it's safe to say that everyone's unhappy with the result, but I think everyone can make do with what we have.

What ultimately resulted, if you don't feel like going and looking up the post in the archives (I'm offline so I don't have access to a URL), is this.

A 'string', for parrot, is a combination of byte buffer and grapheme buffer. (Graphemes are the smallest unit of text representable. They're usually represented by a single integer, but accented characters and some scripts may represent them with more than one integer) Yes, this is a bad idea, but it's how programs deal with them, so we cope. Anyway, programs may look at these strings byte by byte, integer by integer, or grapheme by grapheme. Each string has an encoding (which is responsible for turning the bytes in the underlying buffer to integer code points) and a character set (which is responsible for giving some meaning to those code points) attached to it. Programs can deal with strings either in their 'native' form or as purely unicode data, and if a string isn't unicode, treating as unicode will cause parrot to automatically convert it from whatever form it is to Unicode. (Which makes the "All-Unicode all the time" folks reasonably content)

This duality provides the benefits of delayed (possibly delayed to never) conversion saving CPU time, mmappability of the source text (hey, after all, if it's not Unicode on disk but you never convert it, and are only reading it, why not just map the file into memory and pretend you read it the old-fashioned way?), and the ability to natively manipulate non-Unicode text without having to pretend there are files involved. (Because sometimes you do need to use native character sets without files--if you're generating zip files in-memory, or talking to a database) Plus there's the bonus of not burning conversion time to hoist Latin-n text to Unicode if you really do want to treat it as Latin-n text.

The encoding and character set systems are all pluggable and dynamically loadable as well, so if you don't want to yank in ICU to process your ASCII text, you don't have to. Which is swell for the large number of people who don't want to.

The single most difficult part of this job, by the way, isn't the technical issues. It's the politics. But at least I knew that going in. (Though, honestly, knowing and understanding are two very different things)

Posted by Dan at June 16, 2004 10:30 PM | TrackBack (0)
Comments