October 11, 2003

What the heck is: A string

And no, it's not nearly as stupid a question as it may seem.

This time out, we're going to talk about string data, what it means, how its interpreted, and odds are how much stuff you've never had to think about with it. Strings are remarkably complex things. Remarkably confusing, too, if you're mono-lingual in a language with a simple base writing system. (Like, say, me)

A "string", for our purposes, is a series of bits that represents some piece of text in a human language. What you're reading now is text, as is the stuff in your phone book, the heiroglyphics on Egyptian tombs, and the chinese characters for "Ignorant American" embroidered on that faux-Tao t-shirt you like to wear. (Important safety tip--never get a tattoo in a language you don't understand. "Baka" is not Japanese for "good fortune") Strings are how this text is represented inside a computer.

What makes strings potentially complex is the wide range of ways that human writing systems have developed over the millennia. Writing has generally been governed by the capabilities of pen or brush, ink, and paper, which are very different than the capabilities of computers. Because of this, there are writing systems where there are optional or semi-optional notational marks on characters (as in most western european languages) thousands of unique characters (such as Chinese), and writing systems where the shape of characters varies depending on preceding and succeeding characters (like Arabic). Even if we skip how the strings are actually rendered on screen or paper (which takes away the issues of having enough dots to legibly render those thousands of characters, or figure out which direction the text should go, or what shapes the letters should be based on their context) there are plenty of issues to deal with.

Computers, of course, don't handle stuff like this at all well. They're best suited for linear sequences of bits and bytes, the simpler the better. Because of this, many different schemes for representing text have been developed over the years. These schemes have been shaped both by the complexity (or simplicity) of the writing systems the represent and by the limitations of the machines when the schemes were invented. All the schemes have common characteristics, though, so we can talk about them all in the abstract, which is nice.

Computers, ultimately, represent text as a series of abstract integers, and those abstract integers are represented on disk or in memory as a series of bits. The integers are generally called characters, with all the characters for a language taken together called a character set. Because people tend to think that individual characters represent single printed things, Unicode prefers to call them code points, and so do I, so we will. Just be aware that in some circumstances a single code point isn't enough to figure out what glyph should be printed.

Glyphs, on the other hand, are the printed representation of the smallest unit of text in a language. You normally need one or more code points to figure out what a glyph is, though it's not always quite that easy. (Arabic seems particularly odd in this respect, where multiple code points translate to multiple glyphs, but it's not an N code point-> 1 glyph transformation, more an N code point -> M glyphs)

Finally, how those bits and bytes get turned into abstract integer code points is called the encoding. You might think that each individual character set has its own encoding, or that each of those abstract integers is represented by the same number of bits on disk, but you'd be wrong there--life's not nearly that simple.

So, we've got an encoding, which tells us how to turn bits to code points, we've code points, which are abstract integers, and we've glyphs, which are the smallest unit of text in a language, more or less. With those things we have enough information to turn bits into text on the screen.1

Before we go further and really confuse things, let's take a look at a simple example you're probably very familiar with--ASCII. ASCII has a very simple encoding. Each code point is exactly 8 bits, of which 7 are meaningful and one is padding. (Yes, technically bit 8 may be used for parity, but it's been a long time since anyone's cared even on serial lines where it was mildly useful) All code points fit nicely into an 8-bit byte, or octet.2 We'd call this an 8-bit, fixed-length encoding. (I like those, personally--they're simple)

Translating the code points to glyphs is straightforward enough, and you're probably familiar with the ASCII character table. You know the one, it's a big grid with the characters and their numeric values on it. A is 65, B is 66, and so on. Now, there's actually a lot of hidden semantics in there, but we'll deal with those later. It's important to note that, when people talk about ASCII, they're talking about both the ASCII character set and its the encoding.

EBCDIC is another encoding/character set combo you're probably familiar with, or at least heard of. Like ASCII, it uses an 8-bit fixed length encoding, but the letters and numbers map to different integers. A is 192, B is 193, and so on. (More or less--there are a number of EBCDIC character sets) Once again, reasonably simple and straightforward.

A third encoding/character set combo is RAD-50. (A blast from the past for you) This is a set that has 50 (octal) characters in it, and it packs three characters into a 16-bit word. It's a restricted character set, having only the 26 english upper-case letters, the digits 0-9, $, _, and two other characters that I forget at the moment. Each character takes 5 1/3 bits, more or less, and there's no way to tease out the characters without doing some math on the word.3 The character set, then, are the integers from 0 to 39 (50 octal is 40 decimal), while the encoding requires doing some multiplication, division, and/or modulus depending on which character in the stream you were dealing with, and decoding the characters is position-dependent. RAD-50, honestly, is a pain.

Variable-width encodings are ones where there isn't a fixed number of bits. (RAD-50, oddly, is a fixed-width encoding. It just isn't encoded with an integer number of bits...) There are a lot of different ways to handle variable-width encodings, but the two common ways are to pack the data into a series of bytes and to use escape-bytes or marker bits to indicate that you should keep on going.

Escape bytes are ones where, if they occur, indicate that the byte that follows is part of the character. That means some code points are one byte, some are two. (in really extreme cases some are three) The escape bytes note that the byte following is part of the code point. There may be one escape byte in the set, in which case you get 511 code points, or N escape bytes in which case you get (256-N) + (256*N) code points. And a fair amount of inconvenience, but that's secondary. Most encoding/charset combos that have escape characters start out with a base character set (usually, though not always, ASCII or an 8-bit extended ASCII) and make all the unused code points escape code points.4 For example, with Shift-JIS (one of the ways to encode Japanese characters) bytes 0x80-0x9F and 0xE0-0xEF are escape bytes, and note that the following byte is part of the code point.

Marker bits are similar, but rather than saying "Codes x-y indicate an escape byte", you say "if bit(s) x (and maybe y) are in some pattern, the next byte is part of the code point", and you build up the final character value from parts of the bytes. Unicode's UTF-8 encoding does this--with it you can encode integers of up to 32 bits in a series of bytes, from 1 to 6 bytes, depending on the integer you're encoding. (The bit encoding's a little odd--if you're interested, this is documented in chapter 3 of the 3.x Unicode standard)

So, anyway, fixed-length encoding and either escape-byte or escape-bit variable length encoding. That's how you turn integers to bytes in a bytestream, or vice versa. Personally I prefer fixed-length encodings, though there are good reasons to use variable-width encodings in data files. (Like, for example, the fact that pure 7-bit ASCII text is also valid Shift-JIS or UTF-8 encoded data. And JIS-X-20x or Unicode characters. But that's in a minute)

Once you run the bytestream through the encoding you get a stream of code points--these abstract integers that more or less represent individual letters/symbols/iconographs/whatever. Is that enough to get a glyph you can display on screen? No! Of course not, it's never that easy. Or, rather, it's not that easy for some sets, and is that easy for others. For many character sets, usually the single-language sets such as ASCII or the JIS sets, there's a one-to-one mapping of code points to glyphs, and each glyph has a unique code point assigned to it. For the multi-language sets, especially Unicode, things are more complex. Since Unicode's the common multi-language set, we'll take that one specifically.

Uncode took on the task of building a single character set that encode all the world's5 written languages. This is a massive task, made more massive by the fact that some written languages don't really have the concept of single, individual glyphs, and others build a single glyph out of multiple parts. (Hangul and some of the Indic scripts apparently do this) Actually having all the legal combinations in the character set is infeasable6 so Unicode introduces the concept of combining characters. These are modifier characters that alter the code point that precedes them in the stream of code points, making that character different.

A nice, simple example is the accent character. This is a mark that indicates where the stress goes in a word, like the one over the i in naíve. Unicode has a combining accent character that puts an accent over the previous non-combining character. So for an accented I, the sequence (in Unicode character names) is LOWERCASE I, COMBINING ACUTE ACCENT. Just for fun, Unicode also has a single character, LOWERCASE I WITH ACUTE ACCENT, that also represents í, which can make for some fun. The two sequences are, according to the Unicode standard, identical, which leads to some interesting code, but Unicode deserves its own WTHI entry so we won't go into any more detail here. Just be aware that some glyphs on screen may be composed of multiple code points, and sometimes there are multiple ways to represent the same glyph that should be treated identically.

Note that you can also mix and match encodings and character sets. As long as the abstract integers that an encoding encodes are large enough to hold the code points for whatever character set you have, you're fine. That means that you could encode JIS-X-20x characters using UTF-8, or ASCII characters in UTF-32. People generally don't, mainly for historical reasons, but there's no reason not to. (This can actually simplify things if you choose to use UTF-32, which is just a 32 bit fixed length integer encoding, for all your data regardless of what character set it comes from, but that's up to you)

So, anyway, we can go from bits on disk to glyphs on screen. Good enough for strings?

Hah! Not by a long shot. Would that it were so easy. And no, Unicode doesn't solve all the problems.

Just being able to display a string, or at least puzzle out the glyphs in the string, isn't enough to properly work with a string, at least not programmatically. For example, how do you uppercase the first letter in élan? Is it Élan, or Elan? and how does it sort? Does é sort before or after a plain e? And if you have a string of chinese characters, where are the word breaks?7 The answer is... it depends. It depends on the language the string comes from, because different languages have different rules. Chinese, Japanese, and Korean all use chinese characters, but how they use them is different and where word breaks are vary. What happens to accented characters depends on which European language you're working with.

Sometimes you can ignore language issues--for example any sort of binary string operation will likely just choose one rule. (Otherwise how would you decide which ordering rule to use if the two strings you're comparing have different and conflicting rules?) Other times you can, but really shouldn't, ignore the rules. When uppercasing a string, for example, in those languages where there's even a concept of case, you should respect the rules of the language the string came from.

So, to truly handle string data, your program needs to attach an encoding, character set, and language to each string, or at least have a set of defaults. (Often in multilingual programs everything is transformed to the Unicode character set with either a UTF-8 or UTF-16 encoding, though transforming to Unicode's not always lossless, depending on the unicode version)

What does this all mean to you, the programmer? Well, that depends. If you're working with text in a single language, not much, as you've got a single set of rules that you likely never even thought about. Even if you're using Unicode (because, say, you're doing XML work) it still doesn't necessarily mean much, because even though you're using Unicode you're probably only encoding a single language, or at worst encoding text from multiple languages but not manipulating it so it doesn't matter. If you're doing real multilingual text manipulation, though, it means there's one more attribute to text than you probably thought and, while you can sometimes ignore it, you can't always ignore it.

After all, this is text in human languages, and the computer ought to do what we need, rather than us doing what the computer needs.

1 Note that this isn't sufficient to manipulate the text. We'll get to that later.
2 Yes, technically a byte might be more or less than 8 bits, but when was the last time you worked on anything else?
3 This, by the way, is why DEC operating systems traditionally did text in multiples of 3 characters with a very restricted character set--everything's RAD-50 encoded for space reasons. Seems silly now, but when your machine has only 128K of memory total for a multiuser system, every bit counts.
4 And if they run out of escape characters and still need more extra characters, they start yanking things out of the base set and throwing them into an extension plane.
5 And some other world's written languages. cf Klingon and Tengwar. Yes, I know Klingon's been rejected and Tengwar probably will be but, like so many other things rejected by the sensible,de facto wins...
6 Combinatorial explosion soon gets you a character set with a few hundred billion characters. And while it's possible to have a 64-bit range for characters, well.. that's just a little nuts. If nothing else, imagine the size of the printed Unicode standard with full character sets!
7 Word breaks are a fuzzy enough thing in languages that use chinese characters anyway--you've got to use a dictionary, some heuristics, a bit of luck, and some random numbers some days.

Posted by Dan at October 11, 2003 02:32 PM | TrackBack (2)

And I thought I had some form of understanding of how Unicode worked before this. Oh well.

As always, Dan, a great WTHI post.

Posted by: Brett at October 11, 2003 09:57 PM

Things aren't quite so bad if you restrict yourself to Unicode, though it's got its own interesting sets of quirks. (The combining characters and canonical composition/decomposition rules come to mind) If you restrict yourself to one single language, or at most two languages, in your data it's also much simpler. Simpler as well if you take the rather common "I shall do it my way and I don't care what your language's rules are" approach.

I also forgot to mention one of the really common complications with spanish text, the double-l. (As in Llama) It's supposed to be treated as a single character and sort as if it were one character rather than two. A better example than the accent. (Or it would be if at least some of the spanish-speaking countries hadn't wussed out and changed the rules of their language to accomodate computers...)

Posted by: Dan at October 11, 2003 10:29 PM

I think that strings are a sufficiently complicated thing (and that was a great summary of why, Dan!) that the runtime really shouldn't try to represent them. In fact, in some cases they are sufficiently complicated that I don't think *one* library is enough to represent them completely. A huge annoyance with Java is its subtly incorrect and pervasive use of Unicode.

For example, displaying encoding-based hints for differences between JIS kanji characters and BIG5 characters is not something I want to have to deal with in the layer of the application that sends text to a GTK text widget. (This is my favorite example of string-related problems ever, because it's such a nice encapsulation of things that unicode got wrong.) I'd like higher level logic in the application which can use colors and sizes and not try to abstract the string away behind an opaque wall.

What I'd want a runtime to do, really, isn't strings at all, but a very fast, mutable and immutable fixed-number-of-bits integer array (8- and 32-bit versions being particularly useful). On top of that there are a lot of different tricks you can pull for encoding/decoding human text in particular contexts.

Posted by: Glyph Lefkowitz at October 12, 2003 06:25 AM

The complexity of strings is the reason why the runtime should represent them. If it doesn't then the task of properly managing that complexity is pushed off to the application programmers, and they'll generally get it somewhat wrong, or push it off to libraries which are awkwardly implemented because they're not truly part of the runtime. (And, odds are, somewhat wrong as well, a common theme with encodings more complex than 7-bit ASCII)

That doesn't mean that the runtime should impose a One True Encoding and One True Character Set, since that seems to go horribly wrong in too many cases, perhaps in part because the people designing and implementing them aren't the people using them. (Or if they are, they're only using them in the "Darn those furriners and their squiggly writing" sense)

Posted by: Dan at October 12, 2003 02:57 PM

Arabic seems particularly odd in this respect, where multiple code points translate to multiple glyphs, but it's not an N code point-> 1 glyph transformation, more an N code point -> M glyphs

I took some Arabic in college and at the time didn't find the script terribly confusing. Having a different form for initial, medial, and final versions of a character seemed natural. After all, many kinds of english script are written with overly ornate initial and final characters in a word.

But this brought to mind a different article I read recently (from Fark? Slashdot? TorgoX's Journal? the URL escapes me) which was promoting the idea that english was largely legible if you kept the initial character, final character, and length of the word intact and used it in an appropriately large context.

Looking back at Arabic then, it's mostly the medial charcters that are stunted versions of their bookend bretheren. The beginnings and endings of words are what really stand out. Maybe arabic script is simply an efficient form of hand-written glyph compression?

Posted by: Clinton at October 14, 2003 12:48 PM

Joel Spolsky has a recent article on this too...


Posted by: Ben Bennett at October 14, 2003 04:18 PM

Where to get the Unicode?
There is software to extract text from any printable documents and save to disk.
It supports many codepages and the Unicode.
It is Miraplacid Text Driver
Besides, you can save a doc as Unicode from Microsoft Word.
Save as - save as type: encoded text, then choose Unicode in the listbox

Posted by: cotty at March 1, 2004 12:36 PM

heck nu met het programa je bent een cresie piep men co

Posted by: bart at March 31, 2004 03:03 AM