October 14, 2003

Strings, some practical advice

Now that I've gone off a bit about the structure of strings, you're probably wondering what exactly good is it all? Knowing how things work doesn't necessarily translate to any sort of utility, at least not immediately. (I think it's essential for real understanding, but real understanding is in short supply on so many things...) But some good advice is in order.

Like so many things I go on about, keep in mind that this is from the standpoint of a VM designer/implementor--I understand and do infrastructure. I have precious little experience implementing applications, unless you consider something like Parrot an application. (Though in this realm I suppose it is, of sorts) I don't know how much of this will be useful, but even if all you come away with is a sense of bitter cynicism about text handling then I'll consider this worthwhile.

First piece of advice: Lose the conceit that Unicode Is The One True Answer. It isn't. It isn't even One Of Many True Answers. It is, at best, a partial solution to a problem that you probably don't even have. A useful partial solution (unlike others, not that I'd mention XML here... :) to be sure, but that's all it is. That's all it attempts to be, and it generally succeeds, which is fine.

The problem Unicode shoots for is to build a single character set that can encode text in all the world's languages simultaneously. Not make it easy to manipulate, just to represent. That's it. Do you need to do that? (Really? Really?) Odds are the number of people who do is vanishingly small, and almost all of us are living on the other side of the application divide. At best most people and applications need to handle stiff in the 7-bit ASCII range, possibly the full Latin-1 or Latin-9 set (with its accented characters and whatnot), and maybe a single language-specific set of characters. Unicode is, as a practical matter, unneccesary for most folks. Latin 1 or Latin 9 works for Western Europe, and North and South America, China's served by Big-5 Simplified, Taiwan by Big-5 Traditional (or vice versa, I can never remember), Japan by one of the JIS standards, and Korea by a KOR encoding. A good chunk of India may well be served by Latin-9 as well. That's about 4 billion people, give or take a few, who don't need Unicode in day-to-day computing.

Yeah, I know, "But what about data exchange? What about cross-language data?" What about it? Odds are you read english (as you're reading this) and at best one other language. You wouldn't know what the hell to do with Japanese/Chinese/Korean/Hebrew/Arabic/Cyrillic text if it was handed to you, so what good is being able to represent it? I mean, great, you use Unicode and you're in a position to take a vast number of characters that you have no clue what to do with. Swell. You now have a huge range of text to badly mishandle.

Second piece of advice: Try not to care about how your text is represented. It's likely your program really doesn't have to care.

Really. Think about it. When was the last time you had to actually care that a capital A has an ASCII value of 65? The overwhemling majority of the time you're dealing with character data symbolically or abstractly. You want to know if the string "ABC" is in your string data, or you want to make it uppercase, or you want to sort it. None of that needs to know, or care, that A is 65, and a good chunk of the code that does know is either busted or is (or ought to be) library code rather than application code. The string library code does need to know, but you don't. Well, unless you're writing low-level library code but I'd bet you aren't.

Yes, I know, there are languages that sort of force this on you if you use the native string (or, in the case of C, "string") types, but even then you usually don't have to know, and often you're better off paying as little attention to the way the base string type is represented as it likely sucks and is often depressingly broken.

I say this as someone who does have to make strings work right. It's a lot of work, sometimes bizarre and unusual work, filled with more edge cases than a Seurat that someone's played connect-the-dots on. Leave as much of the work to the libraries as you can. It won't be enough--it never is--but it'll make things less bad.

Third piece of advice: Don't deal with string data you don't understand. That means that if you don't understand Japanese, don't let japanese text get in. No, I'm not saying this because I'm elitist or I think you should live in linguistic isolation, but be realistic. If your application is designed to handle, say, personnel data for Western European companies, then you're probably fine with Latin 9, and using it restricts your input such that there's far less invalid data that can get in. If you're writing a Japanese text processing system, then one of the JIS sets is the right one for you.

In those (lets be honest, really rare) cases where you think taking this piece of advice will turn out to be restrictive, well, then take piece #2 and use an abstract character type and restrict it to the set you really need to begin with. You can open it up later if you need.

Fourth piece of advice: Make your assumptions explicit! That means if you're assuming you're getting english (or spanish, or french, or vietnamese) text, then don't just assume it--specify it, attach it to your string data, and have checks for it, even if they're just stubs. That way, when you do open things up, you've a reasonable chance of handling things properly. Or at least less wrong.

Fifth piece of advice: Learn a radically different foreign language. And its writing system. For folks coming from a European language background, one of the Asian languages are good, and vice versa. Even if you're never at all fluent, competent, or even haltingly unintelligible, you'll at least get some feel for the variation that's available. Believe me, there's nothing quite like moving from English (with its dead-simple character set and trivial word boundaries) to Japanese (where there are tens of thousands of characters, finding word boundaries requires a dictionary and heuristics, and there are still some valid arguments as to where word boundaries even are). I expect Chinese is similar, and I can't speak for any other language.

Sixth piece of advice: Be very cautious and very diplomatic when handling text. The written word is a fundamental embodiment of a culture and a language. You disparage it, or existing means of encoding it, at your peril. After all, China's had written language for well on six thousand years, give or take a bit--who the heck are you to tell them that they're doing it wrong? Telling someone their language makes no sense or their writing system is stupid (yes, I've heard both) is not a good way to make friends, influence people, or make sure your program works on the whole range of data you've so foolishly decided to take.

Seventh, and final, piece of advice: There is no 30 second solution to make a program handle multiple languages or the full range of Unicode right. (Okay, that's not exactly true. Step 1, bend over. Step 2, grab ankles. You can bet that step 4 isn't profit...) Honestly, don't try. If you're only going to make a cursory or half-hearted attempt at handing the full range of text you're going to accept, I can pretty much guarantee that you'll screw it up more than if you didn't try at all, and screw it up a lot more than if you made a real try at getting it right.

Aren't strings fun? Yes, I'm well aware that a good chunk of the practical advice is "If you don't have to, don't." The corrollary to that is if you do have to, make sure you know what you're doing and that you know what you don't know you're doing. Understanding your areas of ignorance (areas you should go find someone who's not ignorant to help you out) will get you much further than you might think.

Update: Yeah, this originally said Latin-1 instead of Latin-9, but it turns out I was wrong there--you can't do French and Finnish in Latin-1. The eastern european languages don't work with Latin-1 or 9 either, unfortunately, which does sort of argue for Unicode if you need to work across the EU.
Update #2:Latin 9 and latin 1 are near the same--Latin 9 drops some of the less used characters and replaces them with missing characters needed for french and finnish. Oh, and the Euro sign. The Latin-X sets are all ISO-8859 standard sets. More details here Posted by Dan at October 14, 2003 01:29 PM | TrackBack (0)

Comments

Great post. I'm currently working on internationalization where I work (a search-engine company), so I know exactly where you're coming from. The only thing I'd disagree with is that some applications have to be able to represent all text, even if they can't do anything with it other than spit it back out.

Chinese is very similar to Japanese in that detecting word breaks typically requires a dictionary. Proper names are even more difficult to parse correctly, since the only real way to distinguish them is that 1) they're usually three syllables (i.e. three characters), and 2) they don't make any sense in context. On the other hand, Chinese is easier than Japanese when it comes to conjugation -- Chinese simply has no conjugation at all. It also doesn't really have distinctions between levels of politeness.

Korean, I've recently learned, uses spaces to separate phrases, instead of words. So you still need sophisticated techniques to detect word boundaries.

I second your suggestion that people study a radically different language. I studied Mandarin Chinese for two years, and it has helped a lot. For example, most words in Chinese are two characters, so spelling correction algorithms that only consider "edit distance" perform very poorly. There are simply too many other words that are within a single "edit" of any given word. Algorithms that consider radicals and phonetic pronunciation are really your only hope.

Another interesting thing that you didn't mention is the existance of "fullwidth" versus "halfwidth" versus "wide" characters. Unicode has corresponding fullwidth and halfwidth characters for just about every ascii character (the regular versions are called "wide"). The fullwidth and halfwidth versions have exactly the same meaning, but they're displayed in a different size font, so that they'll line up correctly with Asian characters. Moral: if you use Unicode, sometimes the same character can be represented using multiple codepoints.

Posted by: Kimberley Burchett at October 14, 2003 02:13 PM

Oh, yeah, I do agree that some apps do need to represent all the text--I was a bit too heavy-handed on the cynicism there for a moment. Hopefully if you (the general you) are in that position (and you (the specific you), for one, are (I do love english--why couldn't we have stolen two separate words for the second person?)) you'll go get the expertise you need to handle things properly. I've done the search thing and I have an idea of exactly how non-trivial it all can be. Not that I have a solution, just an idea of the size of the problem space. :)

Posted by: Dan at October 14, 2003 02:23 PM

A stimulating discussion, as always, and one that firmly sticks a pin in the overinflated Unicode balloon. I'm no expert in this area, but I do think, however, there are some things that have gotten easier on the application side with Unicode. I've been involved in a site (http://www.library.northwestern.edu/homer) that presents ancient Greek, transliterated Greek (using a Latin alphabet but with diacritics to show vowel length and so on), modern English, and modern German all on the same page. Once upon a time we did the Greek in something called beta code, which is 7-bit ASCII but with adjoining characters combining to form accented characters. Beta code has been around for decades and certainly passes the simplicity test, but it only gives you the glyphs you want if you use a special font that combines certain sets of characters; otherwise it's not really a character encoding at all. For the transliterated Greek we used HTML entity references to represent the long vowel marks and circumflexes. Ditto for umlauts and such in German. The entity references and the beta code both pushed the encoding problem out to the outermost ring of the application and required a lot of messy home-made transformations. It is in fact Unicode that has allowed us at least some degree of getting back to characters as symbols and not having to worry as much about what particular stream of bits represents an omicron with a rough breathing mark. Maybe it's just that what brings simplicity from the application side of things brings pain and suffering to the infrastructure side of things, but in my (extremely limited) experience Unicode can help as long as it's not made into a be-all and end-all mantra.

BTW, English used to have both singular and plural second person pronouns. I'd be happy to explain this to thee (and to you) at greater length, but the gist of it is that people decided simpler was better ;-).

Posted by: Craig Berry at October 14, 2003 05:41 PM

Nice blog, as usual, but I don't really get all the unicode bashing. I for one think it's great to be able to write the name of my home town (which is in Sweden and contains one of those "whatnot" latin1 characters you mentioned) in a mail to my Japanese pals (in Japanese). Latin 1 can't encode kana/kanji and JIS can't encode and ''. You are probably right in that most people don't need this, but... really... what's so wrong with Unicode/utf8 that we should continue to use all those language-specific character sets?

Posted by: 隼へんりく at October 15, 2003 05:49 AM

Well, some places still maintain a distinction between you singular and you plural: y'all.

-ben

Posted by: Ben Bennett at October 15, 2003 09:03 AM

It's not that I dislike Unicode, as such--I really don't. Neither do I see any reason to drop language-specific character sets. (Which I should go on about at some length later, so I will)

Much of the use of Unicode people see, and want, is what you're doing--treating it essentially as a set of short-hand images. You'd be as well served including GIFs instead, and odds are the programs that are dealing with the text can do about as much with it as they could with GIFs. (Odds are they could actually do better with the GIFs--if you were including hebrew or another right-to-left language I'd bet the mail programs would get it wrong)

That's the big problem. People see Unicode as a collection of character images. It isn't. Or, rather, it is (and that's pretty much all it claims to be) but there's a lot more to text than just character images!

The point of a character set is to encode text, and text is a damn sight more than just a series of pictograms in order--if that's all you treat it as then not only are you doomed to fail in the implementation of your text system but you also do a huge disservice to the people who try and use the programs you write. They can enter the characters, correctly if they happen to go in the same direction as the native writing system, but then they can't do anything with those characters, because nobody's going to provide universal proper text handling. (ICU, IBM's unicode library, provides what I'd consider the minimum necessary support for basic Unicode handling, and it's 2.5M without any character data. That's with no language support to speak of) If you want a comparison, programs can't even manage to handle the subset of langauges that Latin-1 encodes properly, and all those languages share a set of common properties.

What Unicode support gets most programs is the twin abilities to display a larger set of pictures that they have no idea what to do and the capability to horribly mangle text in far more languages than they can now.

No offense, but I don't see the point.

Posted by: Dan at October 15, 2003 09:46 AM

My take on Unicode is that it's not better than any single other character encoding system, but that it's better the group of all the other character encoding systems.

That is, if you can assert "the data in here is in Unicode", then you get out of having to deal with the angry array of the other multibyte character sets (except possibly to the extent that you might need to use a library to convert between each of those and Unicode). So, for example, you don't have to have one RE engine for Big5, another for JIS, and so on.

No offense, but I don't see the point of not seeing the point.

Posted by: Sean M. Burke at October 16, 2003 02:11 AM

Interesting ideas, and I can appreciate the point of view, especially the "if you don't understand it, don't use it" stance.

There are applications where unicode *is* the answer, one I was introduced to just recently was multilingual translation. UTF-16 is just about standard for bible translators these days, as they need the same set of tools[1] to be able to function on english as chinese, japanese, and, just about, every other language in the world.

For some, there is a point.

[1] The ones I saw were mostly python, for format conversion, i.e. bible chapter/verse markup (a TeXish language) to PageMaker etc.

Posted by: Stephen Thorne at October 17, 2003 02:02 AM

I think the biggest difference between using GIFs and unicode/utf8 text is that it's possible to copy-paste the text, look it up in a dictionary, and, in the case of Japanese text, unconvert kanji to kana (very useful when you're not so sure about the pronounciation). Really, I don't see the point of making a comparison between GIFs and unicode text. ;)

You are right that Evolution can't handle right-to-left text such as Hebrew (at least I'm not able to type it in). However, it wouldn't be able to handle this regardless of the encoding used. So why not just use unicode/utf8 then? Also, right-to-left works fine for me in almost any other gnome2 app for any kind of characters. Anyway, I kinda like the ability for my apps to mangle all kinds of text. It beats having to remember which files were JIS, EUC, or latin1 encoded. (Or, in the olden days, DOS English codepage, DOS Nordic codepage, or Windows, etc...)

Posted by: Henrik Falck at October 17, 2003 06:32 PM

Ben, my experience is that "y'all" can be used both singular and plural. I've been addressed as "y'all" when I've been alone, and when I've been with friends.

Posted by: Chris at October 18, 2003 12:58 AM

The big advantage for Unicode is that it unifies the mess of encodings. It really helps programs that don't need to look inside the strings. Many programs just need to deal with strings as blocks of characters. They don't need to do parsing or display. The characters are input from somewhere that knows about all the nasty details, like web browsers or text boxes. The strings get stored in databases, read from files, and sent over the network. They pass the display to a web browser or text canvas that handles the fonts and layout.

Posted by: Ian Burrell at October 20, 2003 03:37 PM

I've worked on banking software that has to handle financial messages in 50 different encodings, and it's a huge pain, particularly when you have to worry about configuring the right Arabic encoding on 5 different pieces of software on 3 platforms, and the client is complaining that their text is coming out garbled. In that situation, life would be *so* much easier if all encodings except Unicode would simply disappear from the face of the earth.

Posted by: Ben Geer at October 27, 2003 11:33 AM