January 19, 2004

It's never as easy as it seems

It's time to add in case mangling to parrot. In part because I need it, and in part because, well, it's really well past time to be able to reasonably be able to say "Gimme a (lower|upper|title)-case version of this string".

If you're thinking "Why is this not there yet?", well, I'm tempted to quote Written Language Barbie -- case-mangling is hard. I won't, though, because it's not hard, it's just tedious. And it requires a fair amount of thought to set up the frameworks so you can actually do it properly. (Case identification belongs in the character set, while case transformation is a language-specific operation--if you split language and character set operations out the functions then need to be in separate spots)

Luckily for me, one of the times I was up at O'Reilly's Cambridge office with my car I though to grab a copy of Ken Lunde's CJKV Information Processing book. (And a car was almost a requirement--at 1100+ pages you could hurt yourself hauling it around. Besides, it may well count as a deadly weapon so you couldn't take it on the T anyway) Not, mind, because it covers everything that we'd ever possibly want to do (it's only Chinese, Japanese, Korean, and Vietnamese) but because it's a good reference for a number of encodings and character sets that aren't Unicode, which is nice. (And that I have a personal interest in, which is also nice -- while I have no doubt that Arabic, Hebrew, and Cyrillic are fascinating, but not for me) They've also the advantage of being relatively simple, something that Unicode definitely is not. Plus there's the added advantage of having several semi-related character sets handy. (Well, OK, not exactly related as such, but there are well-known transforms amongst at least some of them, and if we can't get the Big5 transforms right it's time to pack it in)

Yes, this means that Parrot will probably get loadable encoding, character set, and language library code in Real Soon Now. With Unicode too, just to be complete, at least if I can get ICU building properly on Debian Linux. (There's something weird going on that makes it fail on a latest-rev Sarge system, though odds are I'll just punt and teach Configure to link against the system ICU if it's installed)

Soon, I'll be able to make a fool of myself in several languages, not just one! Woohoo! :)

Posted by Dan at January 19, 2004 05:17 PM | TrackBack (0)

Seems like quite a task... Just out of curiosity, how many scripts actually have case distinction? Latin, Greek, Cyrillic, and some variants of those come to mind... but does this include handling case mangling of strings written in a script not originally intended for that language (such as romanizations)? Are there, for instance, any rules on how English should be titlecased in the Cyrillic alphabet? Or romanized Chinese in the Latin alphabet... quite a task... :) Oh well, as long as you get lowercased to so I don't have to write that function myself for every web page I make then I'm happy! ;)

Posted by: Henrik Falck at January 20, 2004 05:04 AM

I honestly don't know how many scripts have case distinctions--this may well be something the Greeks were ultimately responsible, so it's only in languages with heavy Greek influence (at least culturally). Since that includes my native language, it's worth putting in. :)

There are rules for how to handle at least some of this stuff, though I'm not sure anyone's ever directly addressed the rules for case-mangling romanized <insert language here>. My bet is that romanized rules apply, but... On the other hand, as problems go this is relatively low down on the list 'o things, and I expect someone'll have words with me about it after we put it in wrong.

Posted by: Dan at January 20, 2004 08:47 AM