January 20, 2004

The Pain of Text

Yeah, this stuff's all getting cranky. Deal. :)

At the moment, I'm trying to work on specifying text stuff for Parrot. Not simple, of course, because text is such a massive pain. Right now I'm just trying to sort the various functions on characters and strings into the right spot so they can be properly overridden, thumped, assaulted, and generally beaten about.

If you've been following along, you've no doubt seen the rants about text, so I won't reprise them (much) and instead go for the actual useful bits. As far as I can tell (and this is all welded deep into parrot's string handling), there are three basic parts to this:

  1. Encoding to turn a stream of octets into integer code points
  2. Character set To give meaning to the code points
  3. Language To determine the behaviour of the code points

And yeah, there is some overlap between what the character set and the language does. That's part of the problem I'm facing--this stuff's all been invented a dozen or more times and the decisions that were made were all very reasonable but not entirely compatible.

The encoding bit's the least controversial of the layers. Even with the multibyte non-Unicode encodings that do escape-byte stuff there's a pretty straightforward way to map to a 32 bit integer, so that part's easy. (Note that easy and boring have no relation -- putting together the different byte-to-codepoint mapping tables and corresponding code is going to be terribly dull)

The character set's generally non-controversial as long as you don't pay attention to how the individual characters actually look. (If you do, then fights break out and it's not pretty) Unfortunately Unicode adds in the twist of combining characters 1 so it's not quite enough to look at a single code point to get a single character -- in many common cases for me (most of the languages of western Europe) you've the potential of needing two or more code points to represent a single character. I'm not sure if there are cases where it's reasonable to deal with the parts of a character2, but people seem to insist. I dunno, I can't see any circumstances where n and ñ (that second character's an n with a tilde over it, which can be represented as two code points in Unicode) could be in any way equivalent, but what the heck do I know?

The language bit... that's where things get interesting. Not fun, mind, but interesting. Language is where I put the transforms and meaning-association stuff. Case-folding is a matter for language, as are character classification, sorting, and more complex things like word-break determination.

Part of the problem with the language bit is in defaulting -- it arguably ought to be pulled in from the character set, since the language code ought to be independent of the character set but some of the sets are huge (Unicode!) and often you just don't care for characters out of your language. If you've got a string of Chinese text with "llama" thrown in there you're probably going to treat that as a five character word rather than a four character one. (Yes, I know, these days in most places even text that's really marked as spanish treats ll as a two-character sequence rather than one--humor me, I couldn't find an accented character in a non-roman-based character set in the 20 seconds I took to look) Then there's the issue of whether some of these characters ought to be classified one way or another. Even if you have Unicode text, should 一 (ichi, one in Japanese) (which should look like a horizontal bar, assuming it pasted in right, and I got the right character, and you can view it... isn't text fun?) be considered a digit if it's in a string tagged as French? (Heck, should it be in there in a string tagged as Japanese? The number/non-number-word distinction's a bit fuzzy there. Or at least I'm fuzzy on it) Should it even be considered a word character? And yes, you can argue that this is a good reason to put in restrictions on allowable data, but that's not something Parrot can really do, so it needs dealing with.

Parts of the language handling code are also intimately tied to the character set (you can't upcase "a" to "A" if you don't know that the code point you were handed was an "a") so you almost need to have a multiple-dispatch system set up with per-charset language tables and/or code. Fun enough with roman-based alphabets but it gets potentially really fun when you start throwing in all the asian text variants. (I think, ultimately, it'll be relatively simple. Despite the fact that it looks like you could use Shift-JIS (a Japanese character set) to write out a chunk of Chinese text, I'm not sure it'd be considered Chinese, in which case we have a much more restricted set of charset/language pairs. Except for Unicode, which'll sleep with anyone)

Anyway, I think we can do layering stuff enough to hide this. The encoding layer can hand codepoints back and forth, some transcoding, and that's about it. (well, that and some metadata -- lengths and such) Easy enough.

The character set layer can hand you characters, and provide some defaults for the language code to work with. Most of the per-character informational code can live here (is it a letter, is it a digit, is it upper-case, and so on) though the language code potentially ought to get in the way. That'll be simple delegation for the most part.

The language layer is where the transformational and multicharacter fun lives. Case-mangling and word break detection live here (and yes, I know, for some languages word break detection requires an extensive dictionary, complex heuristics, a lunar calendar, and a good random number generator, but...) as do a few other things.

So, for the moment, the list is (and yeah, I'll post it to the internals list):

Encoding
read_codepoint
write_codepoint
to_encoding
from_encoding
knows_encoding

Character Set
read_character
write_character
substring
defaults for language

Language
is_upper
is_lower
upcase
downcase
titlecase
is_alpha
is_number
is_space
charname (maybe)
is_wordbreak (this'll have to take a position)
compare (takes two strings for sorting)

Ah, damn, then there's the question of whether sorting and equivalence testing should take encoding into account or not. (This is for Unicode mainly where you can have composed and decomposed versions of the same character) Though the answer there is "it depends", I'm sure. Bugger.

1 Something I think may be unique to Unicode. Might be wrong about that, though.
2 As opposed to the individual bytes of the encoded, which are useful to deal with

Posted by Dan at January 20, 2004 11:47 AM | TrackBack (0)
Comments

There are combining characters in the ISO-2022 framework.


Here are some useful links--most of which you probably already have, but just in case:



Using the Unicode data files, we can find the answer to Henrik's question on the previous comment page. Searching for CAPITAL in the file "unicode.txt", I see Latin, Greek, various typographical variants thereof, Cyrillic, Georgian, Armenian, Coptic, and Deseret. All appear to be ultimately from the Greek alphabet.

Posted by: Ken Hirsch at January 20, 2004 02:06 PM

The previous comment mentioned it, but I thought I'd throw an example in. Combining characters definitely aren't unique to Unicode. During the survey presented in the first few chapters of Unicode Demystified, the book mentions that while it was feasible to include "precomposed" characters in Latin-1 because there aren't that many combinations, Hebrew in particular was mentioned as a language that winds up with too many combinations because of the way letters and vowels work, so a pre-Unicode character encoding that existed (part of ISO-2022, like Ken said) used combining characters to make it work.

Posted by: Keith at January 20, 2004 02:23 PM

Argh. That dull thudding sound you hear is me beating my head against a wall. While I suppose it doesn't make things any worse, since Unicode already does them, I'm not looking forward to more combining character sets.

This'd all be just so much easier if we had 64-bit characters. I suppose it's a bit late to lobby for that. :(

Posted by: Dan at January 20, 2004 02:30 PM

BTW you can't write meaningful Simplified Chinese (which is currently used in mainland China) text with Japanese kanji characters. Characters in Simplified Chinese are literally simplified ones (from Traditional Chinese characters, still in use in Taiwan) made after WWII, and Japanese kanji are also simplified ones, but differently and with Japanese-original kanji characters, plus purely Japanese characters (kana). They have very little in common.

Posted by: KL at January 21, 2004 09:30 AM

If Simplified Chinese and kanji have no real overlap I'd ask why the Unicode folks unified them all, but then someone'd tell me and I'm pretty sure I'd ultimately regret that. :)

Oddly enough, I'm glad there's no meaningful overlap, as it means few people are likely to do it so I can postpone thinkiing about it for some time period, possibly forever. That'd be good.

Some days I think I don't read enough languages for this job. But, then, I'm not sure anyone reads enough languages for this job...

Posted by: Dan at January 21, 2004 02:22 PM