October 29, 2003

It's alive!

I decided to do the only sensible thing with Forth strings: dodge the whole damn question. I implemented a p" word instead which puts a parrot string on the stack when the word its compiled into is executed. WIth that, and a few other bits of jiggery-pokery, I was good to go. And I do mean good--this code:


: loadcurses p" library/ncurses.pasm " loadpasm ;
: initscr p" ncurses::initscr " findglobal 0 preg invoke resultP ;
loadcurses
initscr

will, in Parrot Forth (when executed from the main parrot directory at least), load up the ncurses interface library and call the initscr function in it. (Which will lock up your terminal session, but that's a sign that it works!)

The next inevitable step is a full forth-based life program, as that's the inevitable demo. (It's the Programmer Magpie Effect :)

Posted by Dan at 03:49 PM | Comments (1) | TrackBack

October 28, 2003

The first big Panther Gotcha

Well, I broke down and bought Pather. It's pretty darned cool and seems snappier for many things. And, of course, Exposé is damned cool, triggering my magpie tendencies. I do like all the rounded and shadow stuff, and the fact that the crappy metal can be turned off in the finder windows makes me very happy.

Up until now, I've not had any huge issues with it. I had to relink Emacs, but that's no big surprise--I have to do that after every sub-point release anyway. Got me why, but I just keep the source around. I should probably resync from CVS and see if that's fixed, but this version works so I'm hesitant to break it, or toss out the 500M or so of stuff to make room for another full rebuild. CamelBones broke as well, which puts something of a damper on working on the infamous book, but it's not like I'm moving too fast there anyway. But...

Powerpoint is only semi-functional. "So what?" you cry, "It's a Microsoft product and they're icky! Keynote is better!" While that may be true, I own Powerpoint, which is more than I can say for Keynote. This wouldn't be so much of a problem if I wasn't presenting next Saturday at the little languages workshop. (Or if I actually had the presentation written....)

Ah, well, hopefully next time I'm in network range I'll find there's an update. It's still mostly functional, so there's not that much of an excuse to slack off....

Posted by Dan at 08:27 PM | Comments (4) | TrackBack

Running headlong into an impedance mismatch

So, I've been thumping away at parrot's forth implementation, and it's been going pretty well. Bunch of control structures are in (do/loop, do/+loop, begin/again, begin/until, if/then), some horribly insecure stuff has been added (' and compile,), and since Forth can live in its own little world I added in cell space for "address" base storage. In this case "address" and "offset into the cell array" are identical, but that's OK -- the data space, at least according to the standard, doesn't have to correspond to anything else, just be accessible. Which it is.

Unfortunately once the cell space comes in, we get into string issues, since the cell area and strings start getting intertwined.

Forth mandates counted strings. And, at least as far as I can figure, it also assumes that each cell in the cell space can hold a single character. Except... for us, it doesn't have to. I can stuff a whole string into a single cell in cell space, and honestly I'd prefer to do that if I can, and take advantage of our built-in string handling code. (We're already counted, and if I don't have to see someone write a UTF-8 decoder in Forth I think I'd count myself lucky) So I've two alternatives. I can leave Forth as it is, with strings being a series of integer characters with a prepended length word (so the string "Cat" would take up 4 cells) or I can make strings atomic things which'd make integration with parrot much nicer but break the forth standard.

Bah. I should do it both ways, just because...

Posted by Dan at 04:38 PM | Comments (6) | TrackBack

October 27, 2003

forth love if honk then

While I ought to detail, in detail, why I don't really hate Unicode, just some of the uses people make of it... instead I've been hacking at the forth implementation in Parrot.

Still integer-only, and it has a single mixed stack (so when ints and strings are put in they'll all coexist on the stack), but Parrot Forth now has working if/then, begin/again, and begin/until control structures. Oh, and when you compile words you really compile words--it generates bytecode on the fly for compiled words.

Woo! (and, I might add, Hoo!)

Now to go wedge in an interface to Parrot's calling conventions so we can write ncurses_life.forth to go with ncurses_life.imc. That's probably the sign of some Apocalypse or other though not, alas, a Perl 6 apocalypse. (When I add in the object extensions to Forth, then expect something big)

Posted by Dan at 01:58 PM | Comments (3) | TrackBack

October 24, 2003

LL3 is on the way

Mark your calendars, it's Saturday November 8, 2003 in Cambridge (our fair city), MA. We've set the schedule, which should hit the web soon.

It looks like it should be a darned good time, with a batch of interesting talks. Oh, and robots--this year we've got a language/robot talk, which should be really cool. (So if you hear reports of MechaGodzilla climbing the Pru you know I'm partially responsible :)

And yeah, this year I got my act together in time and I'm giving the last talk of the day. Bring the pie and brick-bats, whatever the heck they are.

Posted by Dan at 10:14 AM | Comments (1) | TrackBack

October 23, 2003

Parrot Forth

Or, "Fifth", I suppose, since it's not quite Forth. As I said earlier, I'm working on Parrot's Forth implementation. It was originally written by Jeff Goff, and the core (what I consider most of the tough bits) is done and has been working for ages, just nobody noticed. The plan (subject to change if it doesn't pan out) is to use Parrot as the core engine for the language the big Work App uses. (The current engine is old and has a number of issues, not the least of which is a really primitive syntax and a simple integrated ISAM database with limits that we're hitting every week or three)

This project's actually coming along nicely--I've a compiler of sorts for the language that'll translate it into perl code, and we'll use that as a fallback plan if need be--and I should be able to start emitting PIR code (Parrot's intermediate representation, the stuff we feed into IMCC) with about a week's more work. Unfortunately that's not nearly good enough to actually do anything, since most of the interesting functions of this language live in its runtime--specifically its screen and database handling.

I've got code that converts current-format databases over to Postgres databases, complete with triggers and views to preserve the ISAM flavor, and Parrot has interface libraries for both ncurses and Postgres. What I don't have is the library code that'll get between the compiled code and raw ncurses and Postgres, to make sure the semantics of the current language are preserved without having to emit great gobs of assembly for each statement compiled. I could write that library code in assembly, and Parrot assembly is awfully nice as assemblys go, but still... Don't think so.

The sensible thing, then, is to grab a language that compiles to parrot and use that. I could use the language I'm writing the compiler for but, let's be honest, if it was good enough to write that sort of library code I wouldn't have to be writing a compiler to retarget the damn thing. (Well, OK, I would, as the database part of the code is still running us into walls, but the language makes COBOL look sophisticated)

Parrot's got a number of partial and full languages that compile to it, but throwing away the gag languages (besides, Befunge doesn't support Parrot's calling conventions) it's down to either a nice compiled Basic or Forth and, for a number of reasons, I chose Forth. It's simple, I like it (I like Basic too, FWIW), and expanding it isn't a big deal for me, unlike with Basic, at least our current Basic implementation. (Which is nicely done, thanks to Clint Pierce, but the code requires more thought than my gnat-like attention span can muster at the moment)

Now, the current Forth, as it stands, is only a partial implementation, with the lack of control flow its biggest issue. I've been throwing new core words into it all day, in between handling the fallout from Parrot's directory restructuring today. It's dead-easy, and with the cool assemble-and-go bits of parrot (no need to even assemble the .pasm file, just feed it straight into parrot and you're good) there's not even separate compile and run phases. Can't get much easier than that. I snagged a draft copy of the ANS Forth standard (Draft 6, from 1993, so it's not exactly up to date, but I don't have Starting Forth handy) and have been going for it with some glee.

With it working, at least partially, there comes the urge to tinker. Meddle if you will, and alter the essential Forth-ness of the implementation. Having a combined int/float/string/PMC stack, rather than separate stacks, is the first big urge. Having strings as atomic data (rather than addresses and counts on the stack) is a second urge. Adding in OO features is the third. (OO Forth, after all, is less bizarre than an OO assembly) And integrating into Parrot's calling conventions is a fourth. I think... I think I may well do them all.

While I'm at it, I think I may well redo its compilation stage as well. Right now all the words with backing assembly just dispatch right to them, while the user-defined words are kept as a string, as if the user typed them in, and re-parsed and interpreted each time. Which is a clever way to do things which gets you up and running quickly, but as Parrot has the capability to generate bytecode on the fly, well... I think I might build a full-fledged compiler for this stuff. Which should also take very little time, be very compact, and awfully forth-y. We'll see what tomorrow brings.

Posted by Dan at 08:42 PM | Comments (4) | TrackBack

Language hacking for fun 'n profit

Well, fun at least.

You may or may not know, but Parrot's got a simple but functional Forth implementation as part of it. Nothing fancy, just the base line scanner and math functions, but the compiler works so you can do things like 10 20 + . and get 30 as you'd expect, or : GIMME_30 10 20 + . ; if you wanted to package it up as a new word.

Anyway, I need a Parrot language a bit higher-level than plain assembly for work, and if anything counts as "a bit higher level than assembly" it's Forth. Heck, the standard doesn't even require floating point numbers. Or integers larger than 16 bits as base for that matter. (Though 32 bit integers are required to work so you have to fake 'em if they aren't there) So, since I've been fond of Forth forever I figured it's time to go extend the thing and add in the missing bits.

Which, it turns out, is (at least for the non-control-flow words) darned easy and really compact. End result is maybe a dozen or so opcode_t cells for most of these things, which, honestly, is just damned cool. Currently the forth implementation unrolls user-defined words to a sequence of primitives and then executes the primitives, but I think I may see about generating bytecode on the fly with the built-in assembly functionality. That'd be damned cool too. :)

Posted by Dan at 10:37 AM | Comments (10) | TrackBack

October 17, 2003

Words to live by

A man said to the universe: "Sir, I exist!"
"However," replied the universe,
"The fact has not created in me
A sense of obligation."

-- Stephen Crane (1899)
Posted by Dan at 03:59 PM | Comments (0) | TrackBack

October 14, 2003

Strings, some practical advice

Now that I've gone off a bit about the structure of strings, you're probably wondering what exactly good is it all? Knowing how things work doesn't necessarily translate to any sort of utility, at least not immediately. (I think it's essential for real understanding, but real understanding is in short supply on so many things...) But some good advice is in order.

Like so many things I go on about, keep in mind that this is from the standpoint of a VM designer/implementor--I understand and do infrastructure. I have precious little experience implementing applications, unless you consider something like Parrot an application. (Though in this realm I suppose it is, of sorts) I don't know how much of this will be useful, but even if all you come away with is a sense of bitter cynicism about text handling then I'll consider this worthwhile.

First piece of advice: Lose the conceit that Unicode Is The One True Answer. It isn't. It isn't even One Of Many True Answers. It is, at best, a partial solution to a problem that you probably don't even have. A useful partial solution (unlike others, not that I'd mention XML here... :) to be sure, but that's all it is. That's all it attempts to be, and it generally succeeds, which is fine.

The problem Unicode shoots for is to build a single character set that can encode text in all the world's languages simultaneously. Not make it easy to manipulate, just to represent. That's it. Do you need to do that? (Really? Really?) Odds are the number of people who do is vanishingly small, and almost all of us are living on the other side of the application divide. At best most people and applications need to handle stiff in the 7-bit ASCII range, possibly the full Latin-1 or Latin-9 set (with its accented characters and whatnot), and maybe a single language-specific set of characters. Unicode is, as a practical matter, unneccesary for most folks. Latin 1 or Latin 9 works for Western Europe, and North and South America, China's served by Big-5 Simplified, Taiwan by Big-5 Traditional (or vice versa, I can never remember), Japan by one of the JIS standards, and Korea by a KOR encoding. A good chunk of India may well be served by Latin-9 as well. That's about 4 billion people, give or take a few, who don't need Unicode in day-to-day computing.

Yeah, I know, "But what about data exchange? What about cross-language data?" What about it? Odds are you read english (as you're reading this) and at best one other language. You wouldn't know what the hell to do with Japanese/Chinese/Korean/Hebrew/Arabic/Cyrillic text if it was handed to you, so what good is being able to represent it? I mean, great, you use Unicode and you're in a position to take a vast number of characters that you have no clue what to do with. Swell. You now have a huge range of text to badly mishandle.

Second piece of advice: Try not to care about how your text is represented. It's likely your program really doesn't have to care.

Really. Think about it. When was the last time you had to actually care that a capital A has an ASCII value of 65? The overwhemling majority of the time you're dealing with character data symbolically or abstractly. You want to know if the string "ABC" is in your string data, or you want to make it uppercase, or you want to sort it. None of that needs to know, or care, that A is 65, and a good chunk of the code that does know is either busted or is (or ought to be) library code rather than application code. The string library code does need to know, but you don't. Well, unless you're writing low-level library code but I'd bet you aren't.

Yes, I know, there are languages that sort of force this on you if you use the native string (or, in the case of C, "string") types, but even then you usually don't have to know, and often you're better off paying as little attention to the way the base string type is represented as it likely sucks and is often depressingly broken.

I say this as someone who does have to make strings work right. It's a lot of work, sometimes bizarre and unusual work, filled with more edge cases than a Seurat that someone's played connect-the-dots on. Leave as much of the work to the libraries as you can. It won't be enough--it never is--but it'll make things less bad.

Third piece of advice: Don't deal with string data you don't understand. That means that if you don't understand Japanese, don't let japanese text get in. No, I'm not saying this because I'm elitist or I think you should live in linguistic isolation, but be realistic. If your application is designed to handle, say, personnel data for Western European companies, then you're probably fine with Latin 9, and using it restricts your input such that there's far less invalid data that can get in. If you're writing a Japanese text processing system, then one of the JIS sets is the right one for you.

In those (lets be honest, really rare) cases where you think taking this piece of advice will turn out to be restrictive, well, then take piece #2 and use an abstract character type and restrict it to the set you really need to begin with. You can open it up later if you need.

Fourth piece of advice: Make your assumptions explicit! That means if you're assuming you're getting english (or spanish, or french, or vietnamese) text, then don't just assume it--specify it, attach it to your string data, and have checks for it, even if they're just stubs. That way, when you do open things up, you've a reasonable chance of handling things properly. Or at least less wrong.

Fifth piece of advice: Learn a radically different foreign language. And its writing system. For folks coming from a European language background, one of the Asian languages are good, and vice versa. Even if you're never at all fluent, competent, or even haltingly unintelligible, you'll at least get some feel for the variation that's available. Believe me, there's nothing quite like moving from English (with its dead-simple character set and trivial word boundaries) to Japanese (where there are tens of thousands of characters, finding word boundaries requires a dictionary and heuristics, and there are still some valid arguments as to where word boundaries even are). I expect Chinese is similar, and I can't speak for any other language.

Sixth piece of advice: Be very cautious and very diplomatic when handling text. The written word is a fundamental embodiment of a culture and a language. You disparage it, or existing means of encoding it, at your peril. After all, China's had written language for well on six thousand years, give or take a bit--who the heck are you to tell them that they're doing it wrong? Telling someone their language makes no sense or their writing system is stupid (yes, I've heard both) is not a good way to make friends, influence people, or make sure your program works on the whole range of data you've so foolishly decided to take.

Seventh, and final, piece of advice: There is no 30 second solution to make a program handle multiple languages or the full range of Unicode right. (Okay, that's not exactly true. Step 1, bend over. Step 2, grab ankles. You can bet that step 4 isn't profit...) Honestly, don't try. If you're only going to make a cursory or half-hearted attempt at handing the full range of text you're going to accept, I can pretty much guarantee that you'll screw it up more than if you didn't try at all, and screw it up a lot more than if you made a real try at getting it right.

Aren't strings fun? Yes, I'm well aware that a good chunk of the practical advice is "If you don't have to, don't." The corrollary to that is if you do have to, make sure you know what you're doing and that you know what you don't know you're doing. Understanding your areas of ignorance (areas you should go find someone who's not ignorant to help you out) will get you much further than you might think.

Update: Yeah, this originally said Latin-1 instead of Latin-9, but it turns out I was wrong there--you can't do French and Finnish in Latin-1. The eastern european languages don't work with Latin-1 or 9 either, unfortunately, which does sort of argue for Unicode if you need to work across the EU.
Update #2:Latin 9 and latin 1 are near the same--Latin 9 drops some of the less used characters and replaces them with missing characters needed for french and finnish. Oh, and the Euro sign. The Latin-X sets are all ISO-8859 standard sets. More details here

Posted by Dan at 01:29 PM | Comments (12) | TrackBack

October 11, 2003

What the heck is: A string

And no, it's not nearly as stupid a question as it may seem.

This time out, we're going to talk about string data, what it means, how its interpreted, and odds are how much stuff you've never had to think about with it. Strings are remarkably complex things. Remarkably confusing, too, if you're mono-lingual in a language with a simple base writing system. (Like, say, me)

A "string", for our purposes, is a series of bits that represents some piece of text in a human language. What you're reading now is text, as is the stuff in your phone book, the heiroglyphics on Egyptian tombs, and the chinese characters for "Ignorant American" embroidered on that faux-Tao t-shirt you like to wear. (Important safety tip--never get a tattoo in a language you don't understand. "Baka" is not Japanese for "good fortune") Strings are how this text is represented inside a computer.

What makes strings potentially complex is the wide range of ways that human writing systems have developed over the millennia. Writing has generally been governed by the capabilities of pen or brush, ink, and paper, which are very different than the capabilities of computers. Because of this, there are writing systems where there are optional or semi-optional notational marks on characters (as in most western european languages) thousands of unique characters (such as Chinese), and writing systems where the shape of characters varies depending on preceding and succeeding characters (like Arabic). Even if we skip how the strings are actually rendered on screen or paper (which takes away the issues of having enough dots to legibly render those thousands of characters, or figure out which direction the text should go, or what shapes the letters should be based on their context) there are plenty of issues to deal with.

Computers, of course, don't handle stuff like this at all well. They're best suited for linear sequences of bits and bytes, the simpler the better. Because of this, many different schemes for representing text have been developed over the years. These schemes have been shaped both by the complexity (or simplicity) of the writing systems the represent and by the limitations of the machines when the schemes were invented. All the schemes have common characteristics, though, so we can talk about them all in the abstract, which is nice.

Computers, ultimately, represent text as a series of abstract integers, and those abstract integers are represented on disk or in memory as a series of bits. The integers are generally called characters, with all the characters for a language taken together called a character set. Because people tend to think that individual characters represent single printed things, Unicode prefers to call them code points, and so do I, so we will. Just be aware that in some circumstances a single code point isn't enough to figure out what glyph should be printed.

Glyphs, on the other hand, are the printed representation of the smallest unit of text in a language. You normally need one or more code points to figure out what a glyph is, though it's not always quite that easy. (Arabic seems particularly odd in this respect, where multiple code points translate to multiple glyphs, but it's not an N code point-> 1 glyph transformation, more an N code point -> M glyphs)

Finally, how those bits and bytes get turned into abstract integer code points is called the encoding. You might think that each individual character set has its own encoding, or that each of those abstract integers is represented by the same number of bits on disk, but you'd be wrong there--life's not nearly that simple.

So, we've got an encoding, which tells us how to turn bits to code points, we've code points, which are abstract integers, and we've glyphs, which are the smallest unit of text in a language, more or less. With those things we have enough information to turn bits into text on the screen.1

Before we go further and really confuse things, let's take a look at a simple example you're probably very familiar with--ASCII. ASCII has a very simple encoding. Each code point is exactly 8 bits, of which 7 are meaningful and one is padding. (Yes, technically bit 8 may be used for parity, but it's been a long time since anyone's cared even on serial lines where it was mildly useful) All code points fit nicely into an 8-bit byte, or octet.2 We'd call this an 8-bit, fixed-length encoding. (I like those, personally--they're simple)

Translating the code points to glyphs is straightforward enough, and you're probably familiar with the ASCII character table. You know the one, it's a big grid with the characters and their numeric values on it. A is 65, B is 66, and so on. Now, there's actually a lot of hidden semantics in there, but we'll deal with those later. It's important to note that, when people talk about ASCII, they're talking about both the ASCII character set and its the encoding.

EBCDIC is another encoding/character set combo you're probably familiar with, or at least heard of. Like ASCII, it uses an 8-bit fixed length encoding, but the letters and numbers map to different integers. A is 192, B is 193, and so on. (More or less--there are a number of EBCDIC character sets) Once again, reasonably simple and straightforward.

A third encoding/character set combo is RAD-50. (A blast from the past for you) This is a set that has 50 (octal) characters in it, and it packs three characters into a 16-bit word. It's a restricted character set, having only the 26 english upper-case letters, the digits 0-9, $, _, and two other characters that I forget at the moment. Each character takes 5 1/3 bits, more or less, and there's no way to tease out the characters without doing some math on the word.3 The character set, then, are the integers from 0 to 39 (50 octal is 40 decimal), while the encoding requires doing some multiplication, division, and/or modulus depending on which character in the stream you were dealing with, and decoding the characters is position-dependent. RAD-50, honestly, is a pain.

Variable-width encodings are ones where there isn't a fixed number of bits. (RAD-50, oddly, is a fixed-width encoding. It just isn't encoded with an integer number of bits...) There are a lot of different ways to handle variable-width encodings, but the two common ways are to pack the data into a series of bytes and to use escape-bytes or marker bits to indicate that you should keep on going.

Escape bytes are ones where, if they occur, indicate that the byte that follows is part of the character. That means some code points are one byte, some are two. (in really extreme cases some are three) The escape bytes note that the byte following is part of the code point. There may be one escape byte in the set, in which case you get 511 code points, or N escape bytes in which case you get (256-N) + (256*N) code points. And a fair amount of inconvenience, but that's secondary. Most encoding/charset combos that have escape characters start out with a base character set (usually, though not always, ASCII or an 8-bit extended ASCII) and make all the unused code points escape code points.4 For example, with Shift-JIS (one of the ways to encode Japanese characters) bytes 0x80-0x9F and 0xE0-0xEF are escape bytes, and note that the following byte is part of the code point.

Marker bits are similar, but rather than saying "Codes x-y indicate an escape byte", you say "if bit(s) x (and maybe y) are in some pattern, the next byte is part of the code point", and you build up the final character value from parts of the bytes. Unicode's UTF-8 encoding does this--with it you can encode integers of up to 32 bits in a series of bytes, from 1 to 6 bytes, depending on the integer you're encoding. (The bit encoding's a little odd--if you're interested, this is documented in chapter 3 of the 3.x Unicode standard)

So, anyway, fixed-length encoding and either escape-byte or escape-bit variable length encoding. That's how you turn integers to bytes in a bytestream, or vice versa. Personally I prefer fixed-length encodings, though there are good reasons to use variable-width encodings in data files. (Like, for example, the fact that pure 7-bit ASCII text is also valid Shift-JIS or UTF-8 encoded data. And JIS-X-20x or Unicode characters. But that's in a minute)

Once you run the bytestream through the encoding you get a stream of code points--these abstract integers that more or less represent individual letters/symbols/iconographs/whatever. Is that enough to get a glyph you can display on screen? No! Of course not, it's never that easy. Or, rather, it's not that easy for some sets, and is that easy for others. For many character sets, usually the single-language sets such as ASCII or the JIS sets, there's a one-to-one mapping of code points to glyphs, and each glyph has a unique code point assigned to it. For the multi-language sets, especially Unicode, things are more complex. Since Unicode's the common multi-language set, we'll take that one specifically.

Uncode took on the task of building a single character set that encode all the world's5 written languages. This is a massive task, made more massive by the fact that some written languages don't really have the concept of single, individual glyphs, and others build a single glyph out of multiple parts. (Hangul and some of the Indic scripts apparently do this) Actually having all the legal combinations in the character set is infeasable6 so Unicode introduces the concept of combining characters. These are modifier characters that alter the code point that precedes them in the stream of code points, making that character different.

A nice, simple example is the accent character. This is a mark that indicates where the stress goes in a word, like the one over the i in naíve. Unicode has a combining accent character that puts an accent over the previous non-combining character. So for an accented I, the sequence (in Unicode character names) is LOWERCASE I, COMBINING ACUTE ACCENT. Just for fun, Unicode also has a single character, LOWERCASE I WITH ACUTE ACCENT, that also represents í, which can make for some fun. The two sequences are, according to the Unicode standard, identical, which leads to some interesting code, but Unicode deserves its own WTHI entry so we won't go into any more detail here. Just be aware that some glyphs on screen may be composed of multiple code points, and sometimes there are multiple ways to represent the same glyph that should be treated identically.

Note that you can also mix and match encodings and character sets. As long as the abstract integers that an encoding encodes are large enough to hold the code points for whatever character set you have, you're fine. That means that you could encode JIS-X-20x characters using UTF-8, or ASCII characters in UTF-32. People generally don't, mainly for historical reasons, but there's no reason not to. (This can actually simplify things if you choose to use UTF-32, which is just a 32 bit fixed length integer encoding, for all your data regardless of what character set it comes from, but that's up to you)

So, anyway, we can go from bits on disk to glyphs on screen. Good enough for strings?

Hah! Not by a long shot. Would that it were so easy. And no, Unicode doesn't solve all the problems.

Just being able to display a string, or at least puzzle out the glyphs in the string, isn't enough to properly work with a string, at least not programmatically. For example, how do you uppercase the first letter in élan? Is it Élan, or Elan? and how does it sort? Does é sort before or after a plain e? And if you have a string of chinese characters, where are the word breaks?7 The answer is... it depends. It depends on the language the string comes from, because different languages have different rules. Chinese, Japanese, and Korean all use chinese characters, but how they use them is different and where word breaks are vary. What happens to accented characters depends on which European language you're working with.

Sometimes you can ignore language issues--for example any sort of binary string operation will likely just choose one rule. (Otherwise how would you decide which ordering rule to use if the two strings you're comparing have different and conflicting rules?) Other times you can, but really shouldn't, ignore the rules. When uppercasing a string, for example, in those languages where there's even a concept of case, you should respect the rules of the language the string came from.

So, to truly handle string data, your program needs to attach an encoding, character set, and language to each string, or at least have a set of defaults. (Often in multilingual programs everything is transformed to the Unicode character set with either a UTF-8 or UTF-16 encoding, though transforming to Unicode's not always lossless, depending on the unicode version)

What does this all mean to you, the programmer? Well, that depends. If you're working with text in a single language, not much, as you've got a single set of rules that you likely never even thought about. Even if you're using Unicode (because, say, you're doing XML work) it still doesn't necessarily mean much, because even though you're using Unicode you're probably only encoding a single language, or at worst encoding text from multiple languages but not manipulating it so it doesn't matter. If you're doing real multilingual text manipulation, though, it means there's one more attribute to text than you probably thought and, while you can sometimes ignore it, you can't always ignore it.

After all, this is text in human languages, and the computer ought to do what we need, rather than us doing what the computer needs.

1 Note that this isn't sufficient to manipulate the text. We'll get to that later.
2 Yes, technically a byte might be more or less than 8 bits, but when was the last time you worked on anything else?
3 This, by the way, is why DEC operating systems traditionally did text in multiples of 3 characters with a very restricted character set--everything's RAD-50 encoded for space reasons. Seems silly now, but when your machine has only 128K of memory total for a multiuser system, every bit counts.
4 And if they run out of escape characters and still need more extra characters, they start yanking things out of the base set and throwing them into an extension plane.
5 And some other world's written languages. cf Klingon and Tengwar. Yes, I know Klingon's been rejected and Tengwar probably will be but, like so many other things rejected by the sensible,de facto wins...
6 Combinatorial explosion soon gets you a character set with a few hundred billion characters. And while it's possible to have a 64-bit range for characters, well.. that's just a little nuts. If nothing else, imagine the size of the printed Unicode standard with full character sets!
7 Word breaks are a fuzzy enough thing in languages that use chinese characters anyway--you've got to use a dictionary, some heuristics, a bit of luck, and some random numbers some days.

Posted by Dan at 02:32 PM | Comments (8) | TrackBack

October 09, 2003

Why I hate C, reason #4,348

Varargs. Or, these days, stdargs. It's how C lets you define functions that take a variable number of arguments.

Why do I hate it? Because, even though you're forced to view the argument list opaquely, there's no argument count! That is, the standard says the variable length argument list lives in a magic va_arg thingie, you must use the Magic Functions to get the individual arguments out, and you must have the magic ... signature to note varargs in the function declaration, but... the standard can't be bothered to mandate that an argument count is passed. This seems utterly mad.

Yes, I know, it's an extra parameter to be constructed and passed, but the caller knows how many parameters its passing in at compile time and that it's passing them in as part of a variable length parameter list so calculation's a compile-time cost. The callee has to figure out how many parameters are being passed anyway, either with an explicit count or something like a format string, so it's not like this information's not available or passed. Not just throwing a count in as something that must be there and instead force people building or using vararg functions to do it anyway, but by hand which is far more error-prone.

Yeah, I'm sure there were Very Good Reasons to not mandate this in the standard. I have no idea what they are, but I'm sure they all suck.

C. Bleah.

Posted by Dan at 10:41 AM | Comments (2) | TrackBack

October 08, 2003

Parrot advances

Or at least exploitation of advances.

Parrot's had the facilities in it to call native functions for quite some time (months, possibly upwards of a year) but we've really not used it any--it's just not solved any real problems for folks doing parrot development. Well, since I'm looking at parrot as a target for production work, I've started using it. At the moment, as part of the parrot repository, there are interface files for ncurses (base and form lib) and PostgreSQL.

There's an ncurses version of life in the examples/assembly directory as well, if you want to play around with it. (It's in PIR format, so it's a touch tough to decipher by hand, though if you go back some versions in CVS you'll find a more readable version) Useful? Well... no, not really, at least not at the moment. (Though I need the ncurses and forms stuff for work) Cool, though.

The PostgreSQL interface is also really keen, in its own way. (Though I'm already a touch annoyed with the connection scheme for Postgres. Polling. Bleah) It means, with a bit of code--like, say:

.pcc_sub _MAIN prototyped
	.param pmc argv
.include "postgres.pasm"
  P17 = global "PostgreSQL::PQconnectStart"
  P0 = P17
  S5 = "host=dbhost dbname=sbinstance user=username password=somepassword"
  invoke

retry:
  P18 = P5
  P0 = global "PostgreSQL::PQconnectPoll"
  invoke

  print "status: "
  print I5
  print "\n"

  eq 3, I5, continue
  eq 0, I5, panic
  sleep 1
  branch retry

panic:
  print "Argh! Failed\n"
  end

continue:
  P0 = global "PostgreSQL::PQexec"
  P5 = P18
  S5 = "create table foo (bar int)"
  invoke

  P0 = global "PostgreSQL::PQresultErrorMessage"
  invoke

  print S5
  print "\n"

  end
.end
You can add a new table, foo, to your postgres database. Presumably other things too, I've just not written the PIR to do it. (Though the full PostgreSQL 7.3 C interface is wrapped)

Dunno whether this counts as scary, or really cool. Or, I suppose, both. :)

Posted by Dan at 11:09 AM | Comments (2) | TrackBack

Guess there are New Toys coming

So, I popped over to the online Apple Store this morning to price out airport base stations, and what do I see? A "We'll be back soon, busy updating the store!" page instead.

On the one hand, bummer... I wanted to get the prices. On the other, as it's far from urgent, I wonder what New Toys they're going to be rolling out later today.

Posted by Dan at 08:49 AM | Comments (1) | TrackBack

October 03, 2003

Win fun

And prizes! Or not.

I do hate breaking in a new machine, but sometimes it can be fun, or at least bemusing. Right now I'm building a clean perl, on the Win32 box, on my OS X box. (Yay, remote desktop, though it's definitely not an optimized screen update...)

Looks like perl 5.8.1 builds with Visual Studio/.NET, at least as long as there are no spaces in the directory path. When this is done, I'm going to see if I can throw the /CLR switch on the compile and generate .NET code rather than native machine code. (Though not tonight, it's relatively late) I'm curious to see what the speed difference between a native and .NET build is. Miniperl may not be good for a whole lot, but it ought to be good for a nice set of benchmark tests. (Though I expect it won't be quite so straightforward as throwing a single switch on the build)

Who knows, maybe this'll change my mind about whether .NET would make an adequate target platform. This ought to be a reasonable benchmark--after all the speed differential between a native and .NET build of perl should be about the difference in speed between Perl 6 on Parrot and perl 6 on .NET.

Posted by Dan at 10:43 PM | Comments (0) | TrackBack

SPF -- one more little piece to help block spam

Well, I went ahead and installed the DNS records to support SPF. SPF is a sender validation name thing that allows you to designate, via DNS, what IP addresses may legitimately be sending mail for a particular domain. Apparently SpamAssassin 2.70 will support SPF as well, marking mail that fails the test (IP addresses are designated as OK, bad, or no data) with a higher spam score. Since, in my case, mail from sidhe.org only comes from my server here, I can safely mark the rest of the world as spoofing mail from me.

Will this stop spam? No, of course it won't. Will it slow spam down? Well... I dunno. If some of the domains that are commonly forged (microsoft.com, aol.com, yahoo.com, and hotmail.com) put in SPF records, it'll mean its easier to throw out mail with forged from addresses, at least from there. Since it's easy, a few automatically generated entries in my BIND config files, and harmless, not hurting to have them in, I figure I might as well go and do it. The more people that do, the more chance this thing has to actually be useful.

Posted by Dan at 05:29 PM | Comments (2) | TrackBack