Squawks of the Parrot: January 2004 Archives

January 31, 2004

Orkut

Well, this thing seems to be taking off. Nat threw in the first invite, so I signed up, and another half-dozen or so folks have fired off invites. This is another one of those networking site things, and I've decided that while I'll go accept invites from anyone I know (I really need to get a picture up--I'd throw the rainbow parrot pic I use for iChat, but the TOS seem to indicate that'd be ill-advised) I don't think I'm going to go to much trouble to actively go search people out.

These things are always interesting to look at, and to look at the networks, but I do wonder how much personal maintenance will it all take? I'd think it wouldn't take too long before a lot of these links become stale. It just feels like a point-in-time thing, where the links are all OK as of some point, something that gets built and maintained for a short while before being abandoned as just too much hassle to bother with. So we'll be able to look at you and get an idea of your immediate links for, say, a week or so, but any fallings-out or new contacts won't be in there.

Asks a lot of questions too, though that's not too surprising as its a combination professional/friend/dating site. I figured I'd answer all the questions right, but I managed to make it only about halfway through before I gave up. Sheesh, after the fiftieth page (or so it seemed) it was getting awfully tedious, and I'm not 100% sure that anyone's going to want to know all that info. (I usually put on the professional face in public and try to behave, and I wonder sometimes if folks'd be surprised if I didn't bother)

I'd bet this thing either takes off insanely (which'd be helped, I think, if a reasonable public API was exposed so you could update and track the info without bothering with the execrable web interface) or dies in a month when people tire of it. I can see either one, but I don't think there'll be any real middle ground.

Bet it lasts longer than LinkedIn, at least.

Posted by Dan at 02:16 PM | Comments (2) | TrackBack

Another conference down

Well, it was a shame that NordU didn't go off entirely, but at least part of it made it through to the end. Allison and I did our tutorials, to a reasonably full room for these things (19 people, IIRC) which was nice. I'm pretty sure that the talk was a success--at least nobody left, fell asleep, or threw things, three indicators of a successful talk. :) The after-talk get together with cph.pm was really nice. The Copehagen perlmongers are good folks, and it's always nice to spend time with 'em.

Copenhagen was really nice, despite the cold, damp wind, and the fact that I was mispronouncing it the whole time there. (D'oh!) I'd love to make it back there to see more things when the weather's warmer, so I'm definitely hoping that YAPC::EU 2005 goes to Copenhagen...

Now it's time to get presentation proposals together for OSCON 2004.

Posted by Dan at 01:49 PM | Comments (0) | TrackBack

January 23, 2004

24 hours and counting

More or less, at least. By this time tomorrow I'll be well on my way to being well on my way to Copenhagen for the NordU parrot tutorial session. (While the conference didn't meet minimum attendance requirements some of the tutorial sessions, including mine and Allison's Perl 6 session, have, so they're on) Should be fun--it's always nice to be able to wander around a place with history.

Allison and I'll be doing a get-together with the Copenhagen.pm folks Wednesday evening. That'll prove to be a good time, I expect. Dunno if I'll spend the rest of the time (Sunday afternoon through Tuesday evening) sightseeing, recovering from jetlag, or working on the second edition revisions to Perl 6 Essentials

And yes, the inadvertent irony of the weather in a Nordic city being significantly more pleasant than the weather I'm leaving isn't lost on me. :)

Posted by Dan at 03:59 PM | Comments (1) | TrackBack

January 20, 2004

The Pain of Text

Yeah, this stuff's all getting cranky. Deal. :)

At the moment, I'm trying to work on specifying text stuff for Parrot. Not simple, of course, because text is such a massive pain. Right now I'm just trying to sort the various functions on characters and strings into the right spot so they can be properly overridden, thumped, assaulted, and generally beaten about.

If you've been following along, you've no doubt seen the rants about text, so I won't reprise them (much) and instead go for the actual useful bits. As far as I can tell (and this is all welded deep into parrot's string handling), there are three basic parts to this:

Encoding to turn a stream of octets into integer code points
Character set To give meaning to the code points
Language To determine the behaviour of the code points

And yeah, there is some overlap between what the character set and the language does. That's part of the problem I'm facing--this stuff's all been invented a dozen or more times and the decisions that were made were all very reasonable but not entirely compatible.

The encoding bit's the least controversial of the layers. Even with the multibyte non-Unicode encodings that do escape-byte stuff there's a pretty straightforward way to map to a 32 bit integer, so that part's easy. (Note that easy and boring have no relation -- putting together the different byte-to-codepoint mapping tables and corresponding code is going to be terribly dull)

The character set's generally non-controversial as long as you don't pay attention to how the individual characters actually look. (If you do, then fights break out and it's not pretty) Unfortunately Unicode adds in the twist of combining characters ¹ so it's not quite enough to look at a single code point to get a single character -- in many common cases for me (most of the languages of western Europe) you've the potential of needing two or more code points to represent a single character. I'm not sure if there are cases where it's reasonable to deal with the parts of a character², but people seem to insist. I dunno, I can't see any circumstances where n and ñ (that second character's an n with a tilde over it, which can be represented as two code points in Unicode) could be in any way equivalent, but what the heck do I know?

The language bit... that's where things get interesting. Not fun, mind, but interesting. Language is where I put the transforms and meaning-association stuff. Case-folding is a matter for language, as are character classification, sorting, and more complex things like word-break determination.

Part of the problem with the language bit is in defaulting -- it arguably ought to be pulled in from the character set, since the language code ought to be independent of the character set but some of the sets are huge (Unicode!) and often you just don't care for characters out of your language. If you've got a string of Chinese text with "llama" thrown in there you're probably going to treat that as a five character word rather than a four character one. (Yes, I know, these days in most places even text that's really marked as spanish treats ll as a two-character sequence rather than one--humor me, I couldn't find an accented character in a non-roman-based character set in the 20 seconds I took to look) Then there's the issue of whether some of these characters ought to be classified one way or another. Even if you have Unicode text, should 一 (ichi, one in Japanese) (which should look like a horizontal bar, assuming it pasted in right, and I got the right character, and you can view it... isn't text fun?) be considered a digit if it's in a string tagged as French? (Heck, should it be in there in a string tagged as Japanese? The number/non-number-word distinction's a bit fuzzy there. Or at least I'm fuzzy on it) Should it even be considered a word character? And yes, you can argue that this is a good reason to put in restrictions on allowable data, but that's not something Parrot can really do, so it needs dealing with.

Parts of the language handling code are also intimately tied to the character set (you can't upcase "a" to "A" if you don't know that the code point you were handed was an "a") so you almost need to have a multiple-dispatch system set up with per-charset language tables and/or code. Fun enough with roman-based alphabets but it gets potentially really fun when you start throwing in all the asian text variants. (I think, ultimately, it'll be relatively simple. Despite the fact that it looks like you could use Shift-JIS (a Japanese character set) to write out a chunk of Chinese text, I'm not sure it'd be considered Chinese, in which case we have a much more restricted set of charset/language pairs. Except for Unicode, which'll sleep with anyone)

Anyway, I think we can do layering stuff enough to hide this. The encoding layer can hand codepoints back and forth, some transcoding, and that's about it. (well, that and some metadata -- lengths and such) Easy enough.

The character set layer can hand you characters, and provide some defaults for the language code to work with. Most of the per-character informational code can live here (is it a letter, is it a digit, is it upper-case, and so on) though the language code potentially ought to get in the way. That'll be simple delegation for the most part.

The language layer is where the transformational and multicharacter fun lives. Case-mangling and word break detection live here (and yes, I know, for some languages word break detection requires an extensive dictionary, complex heuristics, a lunar calendar, and a good random number generator, but...) as do a few other things.

So, for the moment, the list is (and yeah, I'll post it to the internals list):

Encoding
read_codepoint
write_codepoint
to_encoding
from_encoding
knows_encoding

Character Set
read_character
write_character
substring
defaults for language

Language
is_upper
is_lower
upcase
downcase
titlecase
is_alpha
is_number
is_space
charname (maybe)
is_wordbreak (this'll have to take a position)
compare (takes two strings for sorting)

Ah, damn, then there's the question of whether sorting and equivalence testing should take encoding into account or not. (This is for Unicode mainly where you can have composed and decomposed versions of the same character) Though the answer there is "it depends", I'm sure. Bugger.

¹ Something I think may be unique to Unicode. Might be wrong about that, though.
² As opposed to the individual bytes of the encoded, which are useful to deal with

Posted by Dan at 11:47 AM | Comments (5) | TrackBack

January 19, 2004

It's never as easy as it seems

It's time to add in case mangling to parrot. In part because I need it, and in part because, well, it's really well past time to be able to reasonably be able to say "Gimme a (lower|upper|title)-case version of this string".

If you're thinking "Why is this not there yet?", well, I'm tempted to quote Written Language Barbie -- case-mangling is hard. I won't, though, because it's not hard, it's just tedious. And it requires a fair amount of thought to set up the frameworks so you can actually do it properly. (Case identification belongs in the character set, while case transformation is a language-specific operation--if you split language and character set operations out the functions then need to be in separate spots)

Luckily for me, one of the times I was up at O'Reilly's Cambridge office with my car I though to grab a copy of Ken Lunde's CJKV Information Processing book. (And a car was almost a requirement--at 1100+ pages you could hurt yourself hauling it around. Besides, it may well count as a deadly weapon so you couldn't take it on the T anyway) Not, mind, because it covers everything that we'd ever possibly want to do (it's only Chinese, Japanese, Korean, and Vietnamese) but because it's a good reference for a number of encodings and character sets that aren't Unicode, which is nice. (And that I have a personal interest in, which is also nice -- while I have no doubt that Arabic, Hebrew, and Cyrillic are fascinating, but not for me) They've also the advantage of being relatively simple, something that Unicode definitely is not. Plus there's the added advantage of having several semi-related character sets handy. (Well, OK, not exactly related as such, but there are well-known transforms amongst at least some of them, and if we can't get the Big5 transforms right it's time to pack it in)

Yes, this means that Parrot will probably get loadable encoding, character set, and language library code in Real Soon Now. With Unicode too, just to be complete, at least if I can get ICU building properly on Debian Linux. (There's something weird going on that makes it fail on a latest-rev Sarge system, though odds are I'll just punt and teach Configure to link against the system ICU if it's installed)

Soon, I'll be able to make a fool of myself in several languages, not just one! Woohoo! :)

Posted by Dan at 05:17 PM | Comments (2) | TrackBack

Important safety tip for mail admins

If you're filtering outgoing mail for viruses, don't just strip off the virus payload and let the message go on its merry way. Either let it through untouched so my virus filters can catch it and kill it, or kill the damn thing dead. (You killing it is my preference, honestly) Stripping the payload just means I get hammered with bogus messages that takes more effort to deal with.

This seems to be a bigger problem with the latest round of viruses (BAGLE this time) than in the past. At least the annoying flood of "The mail you didn't send because someone else's infected machine spoofed the from was infected!" bounces and warnings have slowed down. I suppose this is an indication that the 'net, collectively does learn.

Heck, at this rate we may discover fire in a few hundred years...

Posted by Dan at 10:24 AM | Comments (3) | TrackBack

January 16, 2004

mod_rewrite can be so much fun....

Apparently the disclaimer on the backgrounds page was insufficiently clear (I shall have to change it to "Make a copy. If you link to these directly I will screw with you") so I get to play fun wildcard games with mod_rewrite. This, for example:

RewriteEngine On



  RewriteCond %{HTTP_REFERER} ^http://www.xanga.com/.*$

  RewriteRule ^/backgrounds/.*\.jpg$ /backgrounds/jpeg_for_folks_who_link_unasked.jpg

RewriteCond %{HTTP_REFERER} ^http://www.xanga.com/.*$ RewriteRule ^/backgrounds/.*\.gif$ /backgrounds/gif_for_folks_who_link_unasked.gif

was a good one. I shall have to gather up a few more extra images (tasteful, if somewhat subversive) and whip up a randomizing CGI program just for a bit of extra fun.

Posted by Dan at 03:35 PM | Comments (8) | TrackBack

January 11, 2004

Progress goes "Whoops, crash, tinkle, tinkle!"

With the last wave of crap coming in, I finally decided it was time to close off the comments for older posts to at least minimize the targets. That meant switching to an SQL database from BerkeleyDB, which is OK since I got burned by Debian with BerkeleyDB recently. (An 'upgrade' in Sarge actually backrev'd things leaving me with some version 9 DBs and a library that only understood up to version 8, but that's a separate problem)

Since I've been using Postgres at work, I figured I'd switch MT over to postgres. It supports that, right? Well...

Ultimately, yeah, it worked. Postgres was locked down in a paranoid way, so badly that you couldn't connect as a different user with a password (fixed in the config, eventually), then a typo in the config file took another 45 minutes or so tracking down wild geese, then it turns out that MT-Blacklist and Posgres don't play nice together (or, rather, storable doesn't, or something) so that meant more config file editing after an upgrade of MT-Blacklist, and that meant editing one of the files since the recommended workaround for Storable issues has a bug, and...

Gah. After an hour or two of annoyances I'm right back where I started, since I'm running out of battery on the laptop and don't feel like plugging in to dig out Jeremy's comment closing script.

Man, I hate computers.

Posted by Dan at 11:26 PM | Comments (2) | TrackBack

Like crap, falling from the sky

Well, it's officially happened. Someone wrote an automated comment spammer for Movable Type. As I type this, I'm watching repeated netstat show port 80 connections, and seeing the "MT-Blacklist comment denial" messages show up in the log. (36 so far) After 37 of the damn things actually made it through (some while I was in the midst of de-spamming things).

Whups, make that 40 of the damn things. A few more just slipped through. And even more on the logs, and I'm not going to bother counting. (Okay, I did. 70 were blocked, 38 made it through (I mis-counted) in a 40 minute period)

This officially counts as insane. I think I'll go analyze the source IP addresses to see if there's any rhyme or reason, but unless they've managed to figure out how to spoof that it looks like these are coming from all over. Makes me wonder if someone's set off a zombie blog spam network, along the lines of those zombie mail spam nets.

I've added some pretty aggressive entries to the anti-spam list now (one of the joys of working for Northern Light -- I remember some of the common spam domain rules. Things like "two or more dashes and it's probably spam" and "The numeral 4? Probably spam" Probably here meaning 90% or better chance, so if you've got a domain with one of those, well... sorry) so hopefully this'll slow things down. http://www.sidhe.org/oldblog/blacklist.txt if you want it, 'specially as the master blacklist isn't being maintained at the moment.

People. Bah. Probably time to disable comments entirely, which is a shame.

Posted by Dan at 06:13 PM | Comments (33) | TrackBack

January 07, 2004

And today's an even better one

After burning the midnight oil (and the Thanksgiving, Christmas, and New Year's oil, and more weekend oil than I'd like to think about) and batting off the flu (twice, dammit! It's not better the second time around) I'm happy to say... it's done!

It, in this case, is the proof-of-concept, first cut porting of the current work-language-from-hell over to something that sucks much less, and has far fewer database limits.

Or, more succinctly, it means I have a working 4GL languag compiler that targets Parrot, with runtime libraries that use NCurses for screen control and PostgreSQL as a back-end database. The compiler's written in Perl 5. The runtime libraries are written in PIR (or IMCC -- basically a higher-level of assembly, one that, ironically, is actually a higher-level language in spots than the 4GL) and the only C involved is Parrot itself.

That last part's actually the most cool bit. Parrot's built-in facilities are sufficient to grant full access to both curses and postgres without needing any extension code in C at all. None. It's sweet. It also means that, when Perl 6 rolls out, a good chunk of the modules on CPAN that have XS components won't need to have them, and you won't have to have a C compiler around to build and install them.

Now I think I'll take a nap for a while...

Posted by Dan at 05:06 PM | Comments (4) | TrackBack

January 03, 2004

Today is a very good day

Because today I finished up the first draft version of the database access RTL for the big work project. Yeah, it's really fragile, and likely to fall over in odd ways when stressed, and the DB schema itself still needs a touch of work (missing UNIQUEs on the primary index) but... it works. I can insert, delete, and read records, as well as going forward and backwards through the DB table in primary key order. I still need to work on partial key matching, and walking forward through a partial key match set, but that's next, and I can probably put it off for a little bit. But hey, it works, and it's kinda cool to watch what's going on in a separate terminal window from the Postgres psql client.

Between this and the curses-based screen RTL code, the runtime for the DecisionPlus translation project is done enough to use, which means now all I have to do is whack the compiler some to use the RTL as it really looks, rather than how I thought it would work when I was doing the last pass. I fully expect quite a few "What the hell was I thinking?" moments. Who knows, if this goes well, maybe I'll nudge the good folks around the office to let me release the required SQL and bytecode for a working Parrot demo, though I'm not sure any demo that starts with "Use this SQL to create a new table in your Postgres 7.4 or higher Postgres database" will be all that popular.

I should note, for those following along at home, that this is all in Parrot. With a stock (well, latest CVS checkout if you're non a non-x86 platform, or on an x86 platform without the JIT¹) interpreter no less. While the resulting bytecode file's a bit big (with the ncurses and postgres wrapper libraries, and the language-specific screen and db RTL a test program's 84K of bytecode, but that's off of 91K of source, so I suppose it's not too bad) it works, and the time to fire up parrot, load in the bytecode, connect to the database (running locally), insert 32 records, delete 1, and fetch 4 is 1.9 seconds. And that includes a 1 second sleep as part of the DB connection code. (Postgres has this weird asynchronous connection scheme where you have to make the connect request then poll to see if it's done. Don't ask, it's just one of those things apparently) 0.14 seconds of combined system/user time total, which is definitely Not Bad.

It's all coming along surprisingly well. It's not even impossible that I might make the Jan 9th demo deadline for this. (I'll definitely have the demo version ready to run for NordU, and if I miss the 9th I still ought to have it for the Boston.pm meeting on the 13th) That'd be nice, as there's lunch for the department on the line, and I'd not like to get between everyone and that... :)

¹ We had to add in a number of NCI function signatures for Postgres 7.4, which I'm using because it has placeholders, and placeholders make life very much easier. Parrot's JIT can build the function calls dynamically on x86 platforms, but at the moment we can't do that elsewhere, so we have to have a wad of precompiled function signatures built into Parrot as a stopgap, so you need the latest version to make it all work. Yeah, I know, we should use libffi, and we may as a fallback, if we don't just give in and build up the function headers everywhere.

Posted by Dan at 05:11 PM | Comments (6) | TrackBack

January 01, 2004

Ah, Copenhagen in the depths of winter

Well, class materials are in, tickets are bought, and I'm off to Copenhagen for NordU in a few weeks. (I get in on the 25th, and leave on the 29th, with the class the afternoon of the 28th) Should be interesting to wander around Copenhagen for a few days, too--I've never been there. (Anyone with recommendations as to what to do, feel free to let fly...)

Posted by Dan at 12:43 PM | Comments (2) | TrackBack