Or: Where things went wrong, and where things went right.
Parrot's not dead or anything, so far as I know. There's too much riding on it for too many people, and last I checked (which, granted, was months ago) it was still going, so this isn't really a post-mortem on the project, but rather on my tenure with it. (Which is dead) This, I'm sure, isn't all of it, but I've been jotting things down as the time's passed and filled in the explanations for the bits and pieces. Hopefully it'll help other suckersvolunteers who might run projects in the future.
Trusting the code gnomes: Good
This is something that worked out pretty well -- that is, checking in code that was barely OK, or minimally functional, and trusting that people would take the skeleton and flesh it out. This happened a lot, and it was quite nice to see.
It's not at all surprising. A lot of what we were doing in parrot was scary for a lot of people. Unnecessarily scary, but that's irrelevant -- we were doing Bizarre Dark Magic, and it didn't matter what we said, actually creating some of that from nothing was more than almost anyone thought they could do. On the other hand, it's far less scary to take a piece of working Dark Magic and change it around some, patch it up, and make it do more. That was relatively easy for a lot of folks.
It took me a while to recognize this, but once I did it was helpful, both in getting stuff done and in getting new people involved. There was a pool of "well, that stuff kinda works, but it's nasty and if it did X too it'd be great" code folks could poke at, as well as a good chunk of code that got checked in with "barely functional but works" checkin messages. Those tended to get quickly rewritten by folks watching the list that wanted to help and saw something they could comfortably work on.
Keeping an eye on the prize: Good
This was a constant fight, made worse since a lot of people didn't do this. (And in part made worse because the long-term plans were detailed in a dozen or more different spots making it difficult for anyone to keep track) Damn near everything I designed in parrot was done with an eye towards what I wanted the total package to look like. That meant juggling threads, Unicode, alternate character encodings, asynchronous IO, potentially asynchronous garbage collection, continuations, events, notifications, perl's dynamic data, all the new stuff Larry was pondering for Perl 6, along with all the stuff perl 5, python, and ruby do now.
This is one of those things that's tough for most people to manage, not because they're not capable (dunno if they are or not. I assume people are) but because to do it means having the whole design in your head, and that's tough if you're not the person doing the design. It's doubly tough if some of the design goals are things you either just don't care about, or actively dislike, and we had a lot of both going on with various people.
Skimping on the docs: Bad
This was a pretty constant refrain -- where are the docs? Well, I didn't write enough of them, and the ones I did write tended to spark fights, either because people strongly disagreed or they weren't clear enough. Unfortunately I found it tough to grab the time I really needed to get the documents written, which was a constant problem.
It didn't help that people tended to ignore the documents once they were written, which pissed me off more than once. Well over half the complaints about docs being out of date were because the code as implemented didn't match things as specified in the docs and the docs were what was supposed to be happening. (This is important -- don't whine about out of date docs or missing docs if you've gone out of your way to ignore the specifications as written)
Not getting an HLL up and running and maintaining it: Bad
Assembly language is fine as far as it goes, and having an assembler done Real Quick(tm) was damned important. What I should have done shortly after that was to get an officially blessed and maintained Parrot Twiddling Language that would be usable by anyone who isn't keen on really low-level stuff. Parrot got a lot of Perl people (and later some Ruby and Python people) but Parrot itself was written in C, and a lot of the low-level hackery needed to be done in C.
There was, though, a massive amount of stuff that didn't need to be done in C. If we'd had a high-level language ready to go nearer to the beginning then we'd have had a much better handle on the compiler interface, a place for people to play much sooner, an easier way to write tests, and a good leg up on the standard library.
The gee-shucks architect: Good and bad
One of the things I decided early on was to downplay my own skills.
Let's be blunt and up front here. I'm good at what I do, and I do a lot of things that a lot of people can't manage. Parrot's first mark and sweep garbage collector was put together in about four hours while I was sitting in my local public library (no wifi but comfy chairs) and it worked pretty well. The hardest part about it was doing the dull stuff involved with it. I'm reasonably certain that's not normal. And no, I don't know how to say "Yeah, I'm good at doing that stuff that makes your head explode -- it's easy" without coming across as an arrogant prat, so I don't. (Say it. I've no idea about the arrogant prat bit)
On the other hand, I'm usually really uncomfortable being up front about what I'm good at (much to a number of people's deep annoyance), and I really, really didn't want people afraid to touch code I'd written because they thought it required some sort of skill at brain surgery to alter.
One of the running jokes was that we'd know Parrot was ready for release 1.0 because all the code I'd written and checked in had been rewritten. I cultivated this, to the point of occasionally checking in bad (working, but crappily written) code on purpose, partially to give the code gnomes something to gnaw on and partly to enhance my rep as an adequate but not great coder. If people thought the code was deeply magic they wouldn't want to touch, and I knew we needed people not afraid to touch.
This was partly a good thing. We got a lot of people who hacked in on parts of an interpreter system that they might not have otherwise dared touch -- if I could deal with it and I was just OK as coder then they could dig in. We got a number of people who might not otherwise get involved actually get involved because of this, and that was a good thing.
Unfortunately the downside there is that I lost a lot of respect from some people, because I was viewed as, at best, an adequate coder. When you've got people who's sole measure of personal worth is the code they produce, well... you can see the problem.
Overcommitting the architect: Bad
I had a full-time job (with more than full-time time commitments), a family, a marriage that's been rocky for ages, and Parrot on my plate of things to do. Any one of those things was enough by itself, and two would keep anyone busy. I had all four, and they all suffered in one form or another. (Parrot, unfortunately, didn't always get the short end of the stick either) Bluntly I had more things to do than any one person could reasonably do, and I didn't have the sense to back out of some of my commitments until things got past bad for me.
Running a project like Parrot, where the scale's damn big, requires a minimum amount of attention, and for a while I didn't have that attention to give it, and what attention I did have was mostly wasted fighting with Leo.
Professional courtesy: Good
One of the things I insisted on was that people behave professionally. (Not, mind, as adults -- I know plenty of adults who behave badly, and some of the kids (and yes, Brent, you and Zach counted at the time :P) involved with parrot behaved better than some of the older adults) I wasn't looking for the sort of cult of personality that pervades the perl, python, and ruby development camps -- I find that rather distasteful, both as the potential target and as a general participant. I also remembered the perl5-porters mailing list when it was more a shark tank than a development mailing list, and I did not want that.
All-volunteer or not, I viewed parrot as a professional engineering project, and I wanted people to behave accordingly. For the most part everyone did, and I was happy about that.
Not telling people to shut the fsck up and cope: Bad
Professional courtesy is fine and all, but if you're running a project, with volunteer labor or not, sometimes you need to tell people to suck it up and deal. More importantly, there's nothing wrong with a judicious exercise of authority, especially if the exercise of authority is part of the job you hold. I was the architect for the project and it was my responsibility to design the system and get that design implemented. Part of that meant that yes, I should have told people to cope. Shying away from confrontation doesn't do anyone any good -- I did, and the project suffered for it.
People not shutting up and dealing: Bad
This one was, to some extent, out of my control, but it was still a problem, especially with Leo. If you're going to take part in a project, that's fine. If a feature or something is under development and you want to chip in, that's fine too. When the decision is made, though, shut up and cope if you disagree. If it's not your call, then once the call's made you deal with it. Odds are the person who made it also has some issues with it, but there are reasons for it to have been made, reasons you may be unaware of, or not understand. Regardless, projects don't go anywhere if decisions keep getting rehashed over and over again.
This was in part because of me not putting my foot down enough, but even in the cases where I did it was often ignored, and that was a problem. If you don't like the decisions being made in a project you register your objections when appropriate, and if things don't work the way you think they should you go away and find something else to do. The world's a big place with a lot of things going on, and nobody in a volunteer project needs to deal with you bitching about or subverting decisions you don't like. Register your complaint and either cope or go away.
Deferring to contributors: Good and bad
Parrot was all-volunteer, and because of that I didn't push people much, and took what I could get. It made sense, since how could I reasonably put any pressure on people who were donating their time and efforts? Well, I should have, at least more than I did.
When you're running a volunteer project, there's nothing wrong with asking people to do things, and expecting that they'll do them if they say they will. People are sensible and know what they can do, and will either step up or not. While you can't demand people do something they've not volunteered for, if they say they will, then you have every right to expect they will do it, and it's fine to ask people to do something as long as you can take no for an answer. Being volunteer does mean people may bail on you with little or no notice, and it means that you take second (or third) place behind other activities, so you need to keep on top of who's doing what, but there's nothing wrong with asking.
I did my best to respect people's other commitments -- everyone's got a life outside Parrot and honestly if the choice was someone doing parrot and making a mess of their home life or bailing on us, I'd be happy to forcibly kick people out to go deal with the important stuff. Not that I had to, but Parrot was just software and on the whole software's just not that important, not in the grand scheme of things.
Keeping mum about Perl 6: Bad
Mmmm, perl 6, the original reason Parrot started, and after a couple of years a nearly irrelevant thing in parrot development. I hear it's more important now, which is fine.
I'll be blunt. I don't give a damn about perl 6 at this point. Haven't for years. I'm not a big OO guy (when we started I had a knee-jerk dislike of objects, a problem I've since shed, though as a performance guy I think they're overused) and perl 6 was getting deeply OO during the design. Plus dealing with Larry as a designer was... well, it was a pain in the ass. I finally gave up in disgust on perl 6 when we lost over a week of Larry's thinking about perl time to the 5 disc DVD releases of one of the Lord of the Rings movies. (The first one, I think, though it's been years)
There was also a lot of creative tension at times in the whole design process, and I hold the dubious honor of being the only person I know of, outside his kids, to get Larry mad. (And yes, he does get mad, though in a nice way. Go figure) This is one of those "laws and sausages" things for most people. Does it matter to you how much waffling Larry's done, or how many times (and over what) he and I just-barely-didn't-shout at each other, or how many things I flat-out told him I wasn't going to implement if he designed them in, or implement even if he forbade it, or how bloody long some things took? No, it really doesn't. What ultimately matters is the final result, the rest is just development crap, and no different than you see in a lot of other projects.
Not, mind, that I think Perl 6 is going to be bad. I don't. I respect Larry immensely as a designer -- he's good, and on some days he hits great, and I say that even disagreeing with some of his decisions. There are a number of good people working to make the design happen, too.
I just don't give a damn, but I kept mostly silent about it, and I think that, for me, that was a mistake. Others may disagree, which is fine, but the perl 6 development process has not been trouble-free, and I think it could stand to have a lot more light shined on it. Looks like that's happening more now, which is a good thing. It just should've happened earlier.
Normally I wouldn't name names -- it's unprofessional and a bit unbecoming. Unfortunately you'd have to be completely uninvolved with parrot or desperately clueless to not know how well we did, or rather didn't, get along. Leo was the single biggest mistake I made with the project.
Leo, bluntly, is a massive pain in the ass, and because of him Parrot was about a year behind where it could've been when I left. I spent far more of what energy I had dealing with him than anything else, rehashing old settled design decisions over and over again, putting up with snide comments, shots on the design (often complaining about missing design documents that actually existed), and whines about the way things were designed and how badly they were. Something, bluntly, I found infuriating since most of the things he complained about were for languages that he didn't program in. This would include perl, python, and ruby. Leo, as someone who never used any of those languages, knew better than those of us who've worked on the cores of one or more of them what was good and bad.
Let's be clear. Yes, Leo writes a lot of code. Yes, he goes and implements features. Those are good things under normal circumstances. Unfortunately his code's difficult to get into, not all that great, and puts off anyone who wants to modify it, pretty much leaving any system he's worked on impenetrable to most anyone else. The features he does implement fresh didn't follow the documentation for those features, and he would re-implement the same system over and over again rather than working on something new. His interpersonal skills drove a lot of people off the project as well -- I'm far from the first who left Parrot because of Leo. Bluntly, Leo was far more of a hindrance than a help, and I put him in the position to kick the crap out of Parrot, so I've nobody to blame but myself here.
Yes, I know, volunteer project, and the third (or is it fourth) rewrite he did on the garbage collection system's fast! Woo! But... so what? In the mean time dealing with him wasted so much time and energy that exceptions, IO, and events never got dealt with, a dozen or more good people went away permanently, a large chunk of parrot's a bloody mess, and a lot of it is prematurely optimized into near-unfixability.
I shouldn't have made Leo pumpking when I knew he'd driven off other developers, and I should've told him to go away the first time I ignored my perl6-internals mail for a week because I couldn't deal with him. I didn't, and Parrot suffered.
So, what did I learn from all this?
I expect there's more, but there you go, and take it for what you will.
This is one of those things that doesn't really have a place in one of the other parrot sub-categories, so I'll just go on about it here.
How the heck was parrot supposed to do context? Context, here, being "what does my caller expect me to return?" For perl 5 this means either scalar (it expects a single value back) or list (it expects multiple values back) but perl 6 made things a bit more complex. (Which is a separate topic, for someone else to talk about)
This was one of the things that really bugged me, and you'll note the glaring lack of context information in parrot's calling conventions. This was very much on purpose -- Larry'd waffled back and forth some on how much context information should be available to subs, and so I'd put it off for a while. This was one of those things that just extending the perl 5 meaning wasn't enough for.
You'll note it never resurfaced, at least not in the calling conventions.
Why? Because it just wasn't needed.
If you look at how parrot calls subs and returns from subs, you'll of course notice that they're done identically. Same registers filled in, same method of invocation, same everything. That's because the way that parrot uses CPS means that calling subs and returning from subs works identically. Return values are just parameters passed in to the return continuation.
Which means that there's no difference between the expected return values and the expected passed-in values for a sub.
Which means that the return context is just the prototype for the return continuation.
So, if you want to figure out what context you were called in, you just need to fetch out the prototype property from your return continuation and you're set.
No muss, no fuss, no fancy nothing.
Granted, this then raises the question of how to specify a prototype for a sub object, but that's a separate question, and one that needs answering anyway, so the answer to it answers the "how do I specify the return context" question.
Back in June I gave a presentation to the Boston ACM. The talk went much longer than I'd planned, and I only managed to get through one of the two sets of slides I had. I promised everyone I'd get an annotated version of the mystery presentation up as soon as I could. Of course, this was in the middle of my machine's "thrashing about and dying" phase, so it's taken a little longer than I'd originally planned.
Finally got it done, though, so if you're interested feel free to snag the annotated PDF of my Parrot Implementations talk. It doesn't cover everything, by any means, but it does talk about some of the interesting things we're doing as part of Parrot's development. (Well, I think they're interesting, at least)
So, I finally did the last draft of the bytecode/assembly level string design for Parrot. It was a mixed bag--the per-string language tag is gone (darn!) but national character sets stay (yay!) with a set of "It's all Unicode no matter what you say" string ops thrown into the mix. Like any other engineering task with multiple conflicting requirements and strong proponents of different schemes, it's safe to say that everyone's unhappy with the result, but I think everyone can make do with what we have.
What ultimately resulted, if you don't feel like going and looking up the post in the archives (I'm offline so I don't have access to a URL), is this.
A 'string', for parrot, is a combination of byte buffer and grapheme buffer. (Graphemes are the smallest unit of text representable. They're usually represented by a single integer, but accented characters and some scripts may represent them with more than one integer) Yes, this is a bad idea, but it's how programs deal with them, so we cope. Anyway, programs may look at these strings byte by byte, integer by integer, or grapheme by grapheme. Each string has an encoding (which is responsible for turning the bytes in the underlying buffer to integer code points) and a character set (which is responsible for giving some meaning to those code points) attached to it. Programs can deal with strings either in their 'native' form or as purely unicode data, and if a string isn't unicode, treating as unicode will cause parrot to automatically convert it from whatever form it is to Unicode. (Which makes the "All-Unicode all the time" folks reasonably content)
This duality provides the benefits of delayed (possibly delayed to never) conversion saving CPU time, mmappability of the source text (hey, after all, if it's not Unicode on disk but you never convert it, and are only reading it, why not just map the file into memory and pretend you read it the old-fashioned way?), and the ability to natively manipulate non-Unicode text without having to pretend there are files involved. (Because sometimes you do need to use native character sets without files--if you're generating zip files in-memory, or talking to a database) Plus there's the bonus of not burning conversion time to hoist Latin-n text to Unicode if you really do want to treat it as Latin-n text.
The encoding and character set systems are all pluggable and dynamically loadable as well, so if you don't want to yank in ICU to process your ASCII text, you don't have to. Which is swell for the large number of people who don't want to.
The single most difficult part of this job, by the way, isn't the technical issues. It's the politics. But at least I knew that going in. (Though, honestly, knowing and understanding are two very different things)
It's bound to happen, but it's something that almost nobody working on a new project wants to deal with -- standardization. Or productization, or some other -ization, of which there are far too many. But it's that point at which you need to look at things and decide that things have gotten large enough that it's time to say "This Will Not Change" and be done with it. It's got to be done, of course, if you ever want a project to move past the toy stage.
Parrot's been doing this in fits and starts as we go along, though up until now many of the "permanent" decisions (for some fairly variable definition of permanent) have been more design things than implementation things. Most of the opcodes have been pretty permanent, but that's about it. Most of the rest is firm but not really fixed, at least not officially. Today, though.... today we start making things official.
In this case, we're officially mapping out the basic variable types that parrot will ship with. (The guarantees here are for a normal version of parrot--stripped down versions may have fewer of these) Nothing fancy--basic undef/int/float/string/bool PMC types and their array variants, plus some of the types parrot uses internally (such as the environment PMC and ordered hash we use for namespaces and pads) but they need defining, so... they're defined. Up until now folks have been generally using the Perl* variants, but besides being distasteful to some, those classes do more than a basic type ought, so this'll be good there.
If you're following along with docs, these types are defined in PDD 17, Basic Types.
Sometimes the more things change the more they stay the same. Other times the more they stay the same the more they change.
Anyway, for those of you keeping track at home, yesterday we just officially gave in and declared that all non-load/store operations on PMCs are always and unconditionally dispatched through Parrot's binary-op multimethod dispatch system. Before you had do actually ask for it, now you get it whether you want it or not.
Multimethod dispatch, as you might remember, is a way of finding which function to run based not just on the function name but also the types of the parameters to the function. So rather than having a single add function, there's a whole list of add functions each with a separate function signature, and when we need to do an add we find the one that's got a signature closest to what we actually have.
Why, then, go MMD?
Well, to start with, we were pretty close to it already. Perl 5 and perl 6 both define their operator overloading functions as mostly MMD. (Perl 6 does it completely, while perl 5 just does it mostly, kinda, sorta) The base object class in parrot does MMD, and we want to make it so Python and Ruby objects descend from it. (Which is doable) So basically... most of the PMC types parrot deals with do MMD.
The way we were doing it was to have the PMC's vtable function do the dispatch if it wanted it. That had some problems, though, the biggest of which is that it set up a continuation barrier. Because of the way Parrot's implemented, continuations can't cross a parrot->C or C->Parrot boundary--that is, once we leave bytecode and enter C, or call into C from bytecode, we set up a wall that a continuation can't cross. (That's because we can't save and restore the system stack) So... if you overloaded addition in bytecode, you could take a continuation just fine, but as soon as you left the vtable function that continuation becomes unusable. Hardly a huge deal, but still... annoying. Also, this is a little slow. Not a lot slow, mind, but we dispatch to the vtable function and then to the mmd function, and if we're doing this 90% of the time anyway then that vtable function dispatch is just wasted overhead.
With MMD as the default we get to change some of that.
First, of course, we cut out the extra overhead in dispatch, since we skip the vtable dispatching altogether.
Second, since we dispatch directly from the op functions to the MMD function, we can check and see if it's a bytecode or C function, and if it's a bytecode function dispatch to it as if it were any other sub, so continuations taken will live longer and will be more usable. (Why you'd take a continuation from within an operator overloading function I don't know, but that's not my call)
Third, we get to shrink down the PMC vtable. Now it's only got get & set functions and some informational and metainformational functions in it. That makes instantiating classes a bit faster, since each class gets its own vtable. (Though if class creation time's a limiting factor in your code then, well, you're gonna scare me)
Fourth, we lift the artificial limit on the number of operators one can do MMD on. Before, you could only really do it on the operations that had functions in the vtable and, while there was a pretty comprehensive list, it was still pretty fixed. Now, well... there's nothing to stop you from adding another table to the list. As long as you've got a default function to call as an absolute fallback, you're fine. This lets us add in the hyper ops as overloadable without having to screw around with expanding the vtables, and that's cool.
There's still a lot of work to be done to make this work out--the vtables have to be chopped down, the ops need to use MMD exclusively, and we need to rejig the PMC preprocessor to take what were vtable functions and turn them into default functions for the PMC type, but... well, this can be done mostly automatically, so it's doable.
This would probably be a good place to plug Eric Kidd's paper Efficient Compression of Generic Function Dispatch Tables, so I will. Useful stuff, that.
So now Parrot's got continuations as the basis for most of its control flow and does MMD for all variable ops... Makes you wonder what's next. (If it turns out we use Lisp as an intermediate language, with annotated Sexprs (encoded in XML) as our AST, you'll know we've crossed over into the Twilight Zone...)
A good question, and hopefully this is a good answer.
This is a response, of sorts, to some of the feedback generated by the Parrot compiler article. (And yep, it's likely that only one of the core parrot guys could write the article, which is why I did--now everyone's got a good chance at writing compilers for parrot :)
The compiler article didn't really get into why my work project's targeting parrot, just that it was doing so. No big surprise, since the opening bits were really meant as a way to draw people into the article as well as give the idea that it's both possible and reasonable to write a compiler for an old, crusty language as a way to transition to something that sucks much less than what you presently have. There are a lot of folks stuck with limited Domain-Specific Languages, or a code base written in some antique dialect of something (usually BASIC or PL/I), or some custom 4GL whose whiz-bang features neither whiz nor bang any more, and while you can just dump the whole wad and move to something new, well... that's expensive, very disruptive, and risky. More than one shop has found that moving a currently working "legacy" system to The Next Great Thing is a big and very expensive step backwards.
Not so say that I don't like moving to something new, but the crusty sysadmin in me wants a solid transition path, good fallback contingency planning, and a Plan B, if not a Plan C, just in case. It's often much better to transition the underlying infrastructure first, then refactor, rewrite, or just plain shoot the code in the nasty source language at your leisure. It's also a lot less work in the short term in many cases, which lets you move over to the new system quickly. Often in these cases the problem isn't the language, no matter how crappy it might be. Instead it's the runtime limitations--memory, database, screen handling, or whatever--that really get in the way, so the old crap language can work just fine, at least for a long time, if you can relieve the underlying runtime pressures.
Anyway, the explanation of the work project.
As I said in the article, our big issue was the database library for this language--standard ISAM DB with file size limits that made sense at the time but are starting to bite us, with no real good way to fix them. We had to move to something else, or we'd ultimately be in a lot of trouble.
The first plan was to write a compiler for DecisionPlus that turned it into Perl code. We'd run the conversion, have a big mass of somewhat ugly perl code, then shoot the original source and be done with it. All the code would then be Perl, we could work with it, and refactor it as we needed to so it didn't suck. (You can hold the snide comments on perl's inherent suckiness, thanks :) We fully expected the result to look somewhat nasty--hell, the language had no subs, the only conditional statement was if/then/else, and control flow was all gosubs and gotos. To labels. All variables were global, there was no scope at all, and, well... ewww. In a big way.
I was brought on because it's rumored I've got reasonably good perl skills, and I've done compiler work, so... I set in.
The initial compiler was written in perl because, well, it's a nice language for text processing if you like it, and it's got a lot of powerful tools (in the form of CPAN) handy. Yeah, I could've done it all in C with a yacc grammar, but it's a lot less effort to get that level of pain just smacking myself with a hammer. The first cut of the compiler didn't take too long, as these things go. A few months and I had something serviceable that would run against the whole source repository and work. There were still some issues with the generated code, but nothing bad.
Unfortunately... the output code was nasty. Not last act of Oedipus Rex nasty, but still... what I had to do in the perl code to maintain the semantics of the DecisionPlus source resulted in awfully ugly code, even with some reasonable formatting by the compiler. Lots of little subroutines (so gosub/return would work), lots of actual gotos, and because of the little subs lots of scopes all over the place to make refactoring painful. One thing I hadn't counted on was the extent of the spaghetti in the code--since there wasn't any syntax to restrict things, control flow was an insane mess full of bits and pieces done by people writing Clever Code. (Like labels that were sometimes goto-d and sometimes gosub-d, with exit paths decided based on if statements checking global variables)
There was another issue as well--typed data. DecisionPlus has a type system rather more restrictive than perl's, including length-restricted strings and bit-limited integers. And, while these things caused no end of obscure bugs, we had to be bug-for-bug compatible because there was code that used these behaviours to their advantage. To get that with perl meant using tied variables to impose the load/store semantics and overloaded operators to get the binary operations correct. Unfortunately ties and overloads are two places where perl 5 is really, really slow. Taking a look at what'd be needed to make this work, it became pretty clear there'd be a lot of overhead, and that the result would be potentially performing badly.
So, we gave a shot, and it became clear that the primary goal, getting an editable perl version of the source, wasn't feasible, and even using perl as a target for the compiler would be sub-optimal. That's when Plan B came in.
Plan B, if you hadn't guessed, was to target Parrot.
Now, I was actually pretty nervous about this -- I was not sure we were ready for prime time with parrot. It seemed like a good idea, but... we weren't sure. Really not sure. I've done the sysadmin thing, and I've been in the "it must never, ever go down" environment, and I know enough to make a sober risk assessment. And, doing so, the answer was... maybe.
Importantly, though, even if things failed, we wouldn't be wasting all our time. We'd still get a more robust and mature compiler, and a better idea of the issues involved with compiling the language, so... we set a very aggressive goal. If I could make it, we'd know parrot was up to the task, and if not, we'd go to Plan C with a better understanding of the problems and a compiler architecture ready to retarget to another back end. Plus we'd have shaken out many of the issues of switching to a new database system (Postgres rather than an ISAM setup) and screen handling system. (I spent a fair amount of time teasing out escape sequences from primitive character function databases, poring over VT220 and xterm escape sequence manuals, and back-translating them to curses functionality. Now that was fun...)
It worked out, of course. We wouldn't be at this point (writing the article and all) if it hadn't, though it was touch and go for a bit. Still, I beat the deadline by a day and two hours, which was cool. And a lot of bugs in parrot were shaken out, and some functionality prompted, because of this, which was good too--always good to have a real live application.
Oh, and Parrot got a mostly working Forth implementation too. So it was a win all around. :)
Or whatever the word for "not pathetically slow" is.
This is a post-mortem on one of the design decisions of parrot, how its mutated from the beginning, and the reasons why. In this case, it's about the stacks.
To know why something is the way it is, it's important to have a handle on the history. Sometimes, of course, that boils down to "Seemed like a good idea at the time" but those are the uninteresting ones. Just picking the answer out of the air, well.. those you expect to change, unless you've got amazingly good instincts, or are darned lucky.
While Parrot was a register-based design from the start (on purpose and with forethought, but that's an explanation for another time) everyone needs stacks--even though most languages are completely lacking in stacks as an exposed fundamental and essential component.1 And one of the things that invariably happens with register designs is you need to save large wads of data to the stack. Either when making a call or having just been called, registers need to be preserved. They're your working set, and presumably the contents are important. (If not, well, why bother in the first place? :)
Parrot's got several register types. This isn't anything particularly new--most processors have at least a split between general-purpose and floating-point registers. Because of GC concerns we split the general purpose registers into integer, PMC, and String as well. I'd toyed with the idea of a flag on each register marking its type so the various routines could use it appropriately, but that seemed both error-prone and slow, since we'd be twiddling flag bits on every store, and checking them on every load and every run of the DOD. (I also worried about the possibility of security breaches on mis-set flags) We've a general-purpose stack and we could just push each register onto the stack as we needed to, but... our stack's typed, and that'd take up a lot of time. Besides, generally you'd just save all of one type of register, rather than one or two. (Since the major use of the stack would be to save temps across function calls) Thus the register backing stacks were born. This was part of the basic initial design.
With each register file having a backing stack, the next question was what to do with it. This is where the trouble we have now began.
Copying data is more expensive than not copying data. Pretty much a given, really, as it's a lot faster to not do something than it is to do something. For this I had an Aha! moment--why not have the top of register stack and the register set be the same place? That is, rather than having a spot where the registers lived for each interpreter, the registers would just be the current top-of-register-stack frame. If you have multiple frames in one chunk of the stack, well... so much is faster! Popping a register frame is just a matter of changing a pointer, and pushing a frame is also just a pointer change. A push that retains the current contents of the registers requires a memory copy, but that's OK--while you can pay the copying cost, you don't have to if you don't want to. The compiler can choose, for efficiency.
That made stack ops pretty darned fast, which was keen. The downside is that it required two pointer indirections and an add to get at a register, but no biggie, right? Well...
This was the point when Daniel Grunblatt (Now he's doing web traffic analysis work--can't hurt to pop over and see if he does what you need) proved me wrong by implementing the first version of Parrot's JIT. Daniel asked the fateful question "Can I make the JITted code specific to an interpreter instance? It'll be faster" I said yes, since most of the code parrot'll be running is single-threaded, one-shot, non-embedded stuff. There'll only ever be one interpreter for the life of the process. With that one assumption, Clever People (like, say, Daniel) will realize that if the registers are in the interpreter structure then the JIT can use absolute addresses to get to them. No math, no pointer indirection, just encode the actual memory address of the register. So Daniel asked if we could make 'em not move.
While I'm not always quick on the uptake, I can recognize a good thing when someone makes it blindingly obvious and smacks me in the head with it. We moved the register file into the interpreter structure, which threw out one pointer indirection on each register access, at the cost of requiring a memcopy to push and pop a register frame to the stack. Given that accessing the registers happens a lot more frequently than pushing and popping, and that most pushes would have a memcopy anyway, it was an obvious win. Sped up the JIT a lot (something like 10% IIRC) and as a bonus sped up the interpreter by about 3%. Whee!
We still had that chunked stack, though, that could have multiple register frames in each chunk. That still made sense, since we wanted to minimize the time required to save off a register frame, and with multiple slots in a chunk most saves were a counter test, memcopy, and counter twiddle. Which is pretty fast.
Enter Continuations. While we'd always planned on them, I didn't understand them, at all. It was a vague, handwavey "Yeah, we'll do those" thing. The folks on the LL1 list set me straight, and in time2 I became a convert. Continuations were everything we needed to wrap up state for sub calls. There's a lot involved in sub calls once you factor in security, loadable oplibs, lexical pads, and lexical global namespaces, and continuations let us wad that all up into one single thing, which was great--fewer ops, and future-proofing too, since we could add more to the context without breaking old code. (It's tough to save something you don't know about if you're forced to explicitly save individual things)
Continuations, though... those require some stack "fun". More specifically, you need to snapshot the stack at the point you take the continuation. Or, in our case, stacks. We could copy them all, but... yech. Besides being slow, we made the jump to a full continuation passing style and no matter how shallow the stacks we really don't want to have to copy them for each call. Slow, very slow. Instead we decided to use the COW system, and just mark the stack chunks as closed and read-only.
This unfortunately has three side effects. First, it means that each and every write to a stack chunk needs to check the COW status of the chunk. Ick. Second, it means that if the chunk isn't empty (which is likely) we'll need to copy it if we do anything to it--push or pop. Double Ick. Third, it means that most stack chunks will be at least partly empty, which is wasteful, what the original design was meant to avoid. (And we really bodged up the stack chunk COW too, so we were actually busted for quite a while)
That leaves us where we are today. We have a design that was originally very efficient for what it did that changed, because of changing requirements, to one that's slow and wasteful. Luckily we know about it before we got to the Pie-thon, so it can be fixed before pastry flies, which is good. Doubly lucky, we can actually fix it without any bytecode knowing the difference which is also good. I like behind-the-scenes fixes. Yay, good layers of abstraction.
FWIW, the new scheme, which is being twiddled on now, is to go to a one-frame-per-chunk stack--that is, each push to the stack allocates a new element. The nice thing here is that it totally eliminates the need for COWing the stack--pushes always allocate a new element which is guaranteed writable, while pops always fiddle with the stack pointer in the interpreter struct which is also always guaranteed writable. Skips the COWing altogether, which takes out a full code path in the push/pop, and since we can allocate frames using the PMC arena allocation code (which is also fast, something like 6-8 cycles to allocate a new PMC if the free list isn't empty) we don't have to write nearly so much code.
Stack ops still aren't free, but, well... what is?
1 Which is weird when you think about it. Phenomenally useful for implementation of a language and yet totally unused in the language itself. Go figure.
2 Though not in time for the first edition of Perl 6 Essentials. D'oh!
We may have to cut a 0.1.1 release of Parrot soon, at the rate things are going. We have runtime loadable bytecode (though we need to get docs written) and more IMCC support for objects--specifically method tags for subs (which sets up the C
Freaky. Really. I'm going to have to get cracking and get the method cache and object internals sped up, at which point maybe we will cut a release...
One of the problems with objects is that, like cats, they occasionally leave dead things under the sofa that you need to explicitly clean up. Generally called finalization (or incorrectly called destruction, but that's a separate issue) this cleanup is a handy thing, as otherwise you'd have all sorts of crud building up as your program ran. Finalization's usually used for things that aren't memory--filehandles that need closing, database connections that need closing, handles on external library resources that need some sort of cleaning up, or weak references you want breaking so other dead things can be found.
Like so many other object things, there's the inevitable fight over what the "right" way to do things is. So far as I can tell, there are three ways, to wit:
The Python/Perl 5 way (in that order, since Larry did swipe it from Guido) is to have the finalizer be a method just like any other method, you look for it and you call it if you find it. If you want your parent methods called, well... you'd darned well better redispatch the call or you're in a lot of trouble. Needless to say, people are sometimes in a lot of trouble. Compounded in perl 5 by a sub-optimal redispatch method, though the NEXT module's finally made it possible to do things right, though only in new code. (Perl's SUPER only looks at the parent classes of the current class, so if you're on the left-hand leg of a multiply inheritant tree when you SUPER you'll never see the righthand leg. I don't know if Python is similarly broken)
Both schemes, as they're actual object methods, have the interesting property of potentially ending up reparenting the dying object, thus making it not dead any more. This can be something of an issue if the object is half-dead when some finalization code decides that the object isn't really dead, and you have some half-deconstructed object lurching across the system like an extra from one of the Dawn of the Dead movies. (Or a bad freshman english essay) The Frankenstinian possibilities tend to keep people from doing this, though, something I approve of.
Ruby brings a third scheme to the table--rather than a method on the object, you have a finalization closure that gets called when the object dies and bits of the object are passed in. This way you have a means of cleaning up without actual access to the object, so there's no chance of bringing back to life--if you want it not-dead you'll have to clone the thing rather than resurrect it.
Anyway, three ways to pick up the trash. Each can be dealt with, and each has its drawbacks when taken individually. The real fun comes in when you mix and match, because how then do you satisfy the constraints and expectations of all the classes in an inheritance hierarchy? More importantly, how the heck do you order them?
If someone's got a good answer (besides "don't do that!") I'd love to hear it...
With the first release to support objects out, the question is now "What the heck can you do with the things?" (And no, I don't know why people ask me these sorts of things, since I don't really like objects. Go figure)
That's a good question. Mostly a better question than "What will I ultimately be able to do with the things" since who knows, maybe I'll give up half-way through and say good enough. It's been known to happen before. (With objects in general, not with me and objects specifically, but that's a separate issue for another day. Tomorrow, maybe)
So, we've got a system where we can call methods, though only specific methods with no fallbacks. We can have classes with one or more parents. Each class in a hierarchy can have as many private slots in an object as it wants. We have a namespace system that works OK as long as you only have a single level of namespace. (And it's possible that, by the time you get around to using namespaces, that it won't even be an issue, as getting this fixed up right is high on the list 'o things that need doing)
It is, oddly, amazing what you can do with this sort of thing. It's enough to support many statically typed object systems, though you do need to disallow operator overloading. The lack of implicit construction's a bit nasty, but you can actually get around that with the compiler if you so chose. No implicit destruction's more problematic. On the other hand I have at least three separate, potentially conflicting, destruction schemes. (Don't ask, you don't want to know. I'll tell you anyway, but later) Can't make do without destruction, though, not really. Sorta.
Which means that parrot objects are at a state where it's worth digging in in preparation for 0.1.1, which will have construction, destruction, and fallback methods, but probably not an end unto itself. Yet. ;)
If you've been hanging out on perl6-internals you probably know, but if you haven't and actually care, here's a quick rundown of where Parrot's object system stands right now. Note that this is not the final state, just the state we're in and will be for the 0.1.0 release.
At the moment you can:
Namespaces are also a bit dodgy, though not too bad. (Just don't use multi-level namespaces right now) There's no method cache yet, so things'll be a bit slow for right now. And IMCC has no syntax to support objects so you've got to manage it all by hand at the moment, though it's really not too bad. Details of the system are in PDD 15 if you're so inclined, though not everything that's documented in there currently works. (And there are, I'm sure, things that aren't in there that need to be)
With this release I think we've vaulted parrot firmly into the mid-'80s. (Alas not the early '70s, since OO stuff seemed to go firmly backwards for a few decades from the high-water mark that Smalltalk set, but we're getting there) More of the missing stuff, especially AUTOLOAD and operator overloading/tying, should be in 0.1.1, but we'll see where that goes.
1 Or, as .NET calls 'em, properties. (I think. Maybe not, opinions vary) Object slot variables. Class-private "every object of this class has this thing in it" things.
2 Which continues the method search as if the method that was actually invoked didn't exist
3 Which continues the method search as if the class the method that was invoked is in is the base class for the object
Or at least mostly done.
Objects, that is. Everything I planned to have them do for the next release is done, so...
Time to stress-test for the roll out for 0.1.0, and enjoy Parrot's crunchy object-y goodness! Enjoy!
And yeah, I know--constructors, destructors, objects that masquerade as other PMC types, AUTOLOAD (or other fallback method providing mechanisms), cross-type inheritance, and multimethod dispatch would all be nice. Hey, you take what you can get, right? :)
Yeah, this stuff's all getting cranky. Deal. :)
At the moment, I'm trying to work on specifying text stuff for Parrot. Not simple, of course, because text is such a massive pain. Right now I'm just trying to sort the various functions on characters and strings into the right spot so they can be properly overridden, thumped, assaulted, and generally beaten about.
If you've been following along, you've no doubt seen the rants about text, so I won't reprise them (much) and instead go for the actual useful bits. As far as I can tell (and this is all welded deep into parrot's string handling), there are three basic parts to this:
And yeah, there is some overlap between what the character set and the language does. That's part of the problem I'm facing--this stuff's all been invented a dozen or more times and the decisions that were made were all very reasonable but not entirely compatible.
The encoding bit's the least controversial of the layers. Even with the multibyte non-Unicode encodings that do escape-byte stuff there's a pretty straightforward way to map to a 32 bit integer, so that part's easy. (Note that easy and boring have no relation -- putting together the different byte-to-codepoint mapping tables and corresponding code is going to be terribly dull)
The character set's generally non-controversial as long as you don't pay attention to how the individual characters actually look. (If you do, then fights break out and it's not pretty) Unfortunately Unicode adds in the twist of combining characters 1 so it's not quite enough to look at a single code point to get a single character -- in many common cases for me (most of the languages of western Europe) you've the potential of needing two or more code points to represent a single character. I'm not sure if there are cases where it's reasonable to deal with the parts of a character2, but people seem to insist. I dunno, I can't see any circumstances where n and ñ (that second character's an n with a tilde over it, which can be represented as two code points in Unicode) could be in any way equivalent, but what the heck do I know?
The language bit... that's where things get interesting. Not fun, mind, but interesting. Language is where I put the transforms and meaning-association stuff. Case-folding is a matter for language, as are character classification, sorting, and more complex things like word-break determination.
Part of the problem with the language bit is in defaulting -- it arguably ought to be pulled in from the character set, since the language code ought to be independent of the character set but some of the sets are huge (Unicode!) and often you just don't care for characters out of your language. If you've got a string of Chinese text with "llama" thrown in there you're probably going to treat that as a five character word rather than a four character one. (Yes, I know, these days in most places even text that's really marked as spanish treats ll as a two-character sequence rather than one--humor me, I couldn't find an accented character in a non-roman-based character set in the 20 seconds I took to look) Then there's the issue of whether some of these characters ought to be classified one way or another. Even if you have Unicode text, should 一 (ichi, one in Japanese) (which should look like a horizontal bar, assuming it pasted in right, and I got the right character, and you can view it... isn't text fun?) be considered a digit if it's in a string tagged as French? (Heck, should it be in there in a string tagged as Japanese? The number/non-number-word distinction's a bit fuzzy there. Or at least I'm fuzzy on it) Should it even be considered a word character? And yes, you can argue that this is a good reason to put in restrictions on allowable data, but that's not something Parrot can really do, so it needs dealing with.
Parts of the language handling code are also intimately tied to the character set (you can't upcase "a" to "A" if you don't know that the code point you were handed was an "a") so you almost need to have a multiple-dispatch system set up with per-charset language tables and/or code. Fun enough with roman-based alphabets but it gets potentially really fun when you start throwing in all the asian text variants. (I think, ultimately, it'll be relatively simple. Despite the fact that it looks like you could use Shift-JIS (a Japanese character set) to write out a chunk of Chinese text, I'm not sure it'd be considered Chinese, in which case we have a much more restricted set of charset/language pairs. Except for Unicode, which'll sleep with anyone)
Anyway, I think we can do layering stuff enough to hide this. The encoding layer can hand codepoints back and forth, some transcoding, and that's about it. (well, that and some metadata -- lengths and such) Easy enough.
The character set layer can hand you characters, and provide some defaults for the language code to work with. Most of the per-character informational code can live here (is it a letter, is it a digit, is it upper-case, and so on) though the language code potentially ought to get in the way. That'll be simple delegation for the most part.
The language layer is where the transformational and multicharacter fun lives. Case-mangling and word break detection live here (and yes, I know, for some languages word break detection requires an extensive dictionary, complex heuristics, a lunar calendar, and a good random number generator, but...) as do a few other things.
So, for the moment, the list is (and yeah, I'll post it to the internals list):
defaults for language
is_wordbreak (this'll have to take a position)
compare (takes two strings for sorting)
Ah, damn, then there's the question of whether sorting and equivalence testing should take encoding into account or not. (This is for Unicode mainly where you can have composed and decomposed versions of the same character) Though the answer there is "it depends", I'm sure. Bugger.
1 Something I think may be unique to Unicode. Might be wrong about that, though.
2 As opposed to the individual bytes of the encoded, which are useful to deal with
Because today I finished up the first draft version of the database access RTL for the big work project. Yeah, it's really fragile, and likely to fall over in odd ways when stressed, and the DB schema itself still needs a touch of work (missing UNIQUEs on the primary index) but... it works. I can insert, delete, and read records, as well as going forward and backwards through the DB table in primary key order. I still need to work on partial key matching, and walking forward through a partial key match set, but that's next, and I can probably put it off for a little bit. But hey, it works, and it's kinda cool to watch what's going on in a separate terminal window from the Postgres psql client.
Between this and the curses-based screen RTL code, the runtime for the DecisionPlus translation project is done enough to use, which means now all I have to do is whack the compiler some to use the RTL as it really looks, rather than how I thought it would work when I was doing the last pass. I fully expect quite a few "What the hell was I thinking?" moments. Who knows, if this goes well, maybe I'll nudge the good folks around the office to let me release the required SQL and bytecode for a working Parrot demo, though I'm not sure any demo that starts with "Use this SQL to create a new table in your Postgres 7.4 or higher Postgres database" will be all that popular.
I should note, for those following along at home, that this is all in Parrot. With a stock (well, latest CVS checkout if you're non a non-x86 platform, or on an x86 platform without the JIT1) interpreter no less. While the resulting bytecode file's a bit big (with the ncurses and postgres wrapper libraries, and the language-specific screen and db RTL a test program's 84K of bytecode, but that's off of 91K of source, so I suppose it's not too bad) it works, and the time to fire up parrot, load in the bytecode, connect to the database (running locally), insert 32 records, delete 1, and fetch 4 is 1.9 seconds. And that includes a 1 second sleep as part of the DB connection code. (Postgres has this weird asynchronous connection scheme where you have to make the connect request then poll to see if it's done. Don't ask, it's just one of those things apparently) 0.14 seconds of combined system/user time total, which is definitely Not Bad.
It's all coming along surprisingly well. It's not even impossible that I might make the Jan 9th demo deadline for this. (I'll definitely have the demo version ready to run for NordU, and if I miss the 9th I still ought to have it for the Boston.pm meeting on the 13th) That'd be nice, as there's lunch for the department on the line, and I'd not like to get between everyone and that... :)
1 We had to add in a number of NCI function signatures for Postgres 7.4, which I'm using because it has placeholders, and placeholders make life very much easier. Parrot's JIT can build the function calls dynamically on x86 platforms, but at the moment we can't do that elsewhere, so we have to have a wad of precompiled function signatures built into Parrot as a stopgap, so you need the latest version to make it all work. Yeah, I know, we should use libffi, and we may as a fallback, if we don't just give in and build up the function headers everywhere.
I swear, text will be the death of me. (If you thought it was objects, nope--they're obnoxious, overblown, and the OO Kool-Ade tastes of almonds, but hardly a full-blown nemesis or anything)
While on the one hand I do really like the shape and form of the alphabets and writing (no surprise, I'm a font magpie too) the implications in actually processing text in these languages is painful to think of. That's even ignoring the issues of rendering or OCRing these sorts of languages. (One big screaming example--you'll note on that page that the trailing sigma has a separate character in the Unicode set, but it should be treated as a plain sigma for text searching reasons. And imagine what should happen to the sigma character if you substr a string and the last character happens to be a sigma that was, up until a moment ago, in the middle of the word. Then you concat a space and another word for display....)
Or, rather, where art thou, since I know why cyan is cyan.
For some reason, I can't get ncurses to make anything foreground cyan. Magenta, red, yellow (well, OK, icky brown), green, white, black, blue... no problem. Just not cyan. Which is really strange. Change the definition of the color to anything else and it works. Hell, I can make the background cyan OK. Just not the foreground. (It shows as the immediately previously defined foreground color, or black if there wasn't one) Setting the foreground to cyan and then setting inverse video doesn't work either. Nore bizarrely, though, black foreground on cyan background in inverse video works! And displays what I want.
It's just really, really bizarre. Happens on both OS X locally on its terminal and the Linux boxen around the office with the GNOME terminal on a local X station, so it isn't my code so far as I can tell. ncurses just won't do foreground cyan.
Have I ever mentioned that computers really suck sometimes?
Update: Argh! It works from C, but not from parrot, but once again only foreground cyan. Ghods, I hate computers some days...
Update2: Turns out there was a bug in IMCC that prevented a literal 6 from being put in register I6 (and 5 into I5, and 7 into I7) that was causing this. Don't ask, it's been fixed.
I decided to do the only sensible thing with Forth strings: dodge the whole damn question. I implemented a p" word instead which puts a parrot string on the stack when the word its compiled into is executed. WIth that, and a few other bits of jiggery-pokery, I was good to go. And I do mean good--this code:
: loadcurses p" library/ncurses.pasm " loadpasm ;
: initscr p" ncurses::initscr " findglobal 0 preg invoke resultP ;
The next inevitable step is a full forth-based life program, as that's the inevitable demo. (It's the Programmer Magpie Effect :)
So, I've been thumping away at parrot's forth implementation, and it's been going pretty well. Bunch of control structures are in (do/loop, do/+loop, begin/again, begin/until, if/then), some horribly insecure stuff has been added (' and compile,), and since Forth can live in its own little world I added in cell space for "address" base storage. In this case "address" and "offset into the cell array" are identical, but that's OK -- the data space, at least according to the standard, doesn't have to correspond to anything else, just be accessible. Which it is.
Unfortunately once the cell space comes in, we get into string issues, since the cell area and strings start getting intertwined.
Forth mandates counted strings. And, at least as far as I can figure, it also assumes that each cell in the cell space can hold a single character. Except... for us, it doesn't have to. I can stuff a whole string into a single cell in cell space, and honestly I'd prefer to do that if I can, and take advantage of our built-in string handling code. (We're already counted, and if I don't have to see someone write a UTF-8 decoder in Forth I think I'd count myself lucky) So I've two alternatives. I can leave Forth as it is, with strings being a series of integer characters with a prepended length word (so the string "Cat" would take up 4 cells) or I can make strings atomic things which'd make integration with parrot much nicer but break the forth standard.
Bah. I should do it both ways, just because...
While I ought to detail, in detail, why I don't really hate Unicode, just some of the uses people make of it... instead I've been hacking at the forth implementation in Parrot.
Still integer-only, and it has a single mixed stack (so when ints and strings are put in they'll all coexist on the stack), but Parrot Forth now has working if/then, begin/again, and begin/until control structures. Oh, and when you compile words you really compile words--it generates bytecode on the fly for compiled words.
Woo! (and, I might add, Hoo!)
Now to go wedge in an interface to Parrot's calling conventions so we can write ncurses_life.forth to go with ncurses_life.imc. That's probably the sign of some Apocalypse or other though not, alas, a Perl 6 apocalypse. (When I add in the object extensions to Forth, then expect something big)
Or, "Fifth", I suppose, since it's not quite Forth. As I said earlier, I'm working on Parrot's Forth implementation. It was originally written by Jeff Goff, and the core (what I consider most of the tough bits) is done and has been working for ages, just nobody noticed. The plan (subject to change if it doesn't pan out) is to use Parrot as the core engine for the language the big Work App uses. (The current engine is old and has a number of issues, not the least of which is a really primitive syntax and a simple integrated ISAM database with limits that we're hitting every week or three)
This project's actually coming along nicely--I've a compiler of sorts for the language that'll translate it into perl code, and we'll use that as a fallback plan if need be--and I should be able to start emitting PIR code (Parrot's intermediate representation, the stuff we feed into IMCC) with about a week's more work. Unfortunately that's not nearly good enough to actually do anything, since most of the interesting functions of this language live in its runtime--specifically its screen and database handling.
I've got code that converts current-format databases over to Postgres databases, complete with triggers and views to preserve the ISAM flavor, and Parrot has interface libraries for both ncurses and Postgres. What I don't have is the library code that'll get between the compiled code and raw ncurses and Postgres, to make sure the semantics of the current language are preserved without having to emit great gobs of assembly for each statement compiled. I could write that library code in assembly, and Parrot assembly is awfully nice as assemblys go, but still... Don't think so.
The sensible thing, then, is to grab a language that compiles to parrot and use that. I could use the language I'm writing the compiler for but, let's be honest, if it was good enough to write that sort of library code I wouldn't have to be writing a compiler to retarget the damn thing. (Well, OK, I would, as the database part of the code is still running us into walls, but the language makes COBOL look sophisticated)
Parrot's got a number of partial and full languages that compile to it, but throwing away the gag languages (besides, Befunge doesn't support Parrot's calling conventions) it's down to either a nice compiled Basic or Forth and, for a number of reasons, I chose Forth. It's simple, I like it (I like Basic too, FWIW), and expanding it isn't a big deal for me, unlike with Basic, at least our current Basic implementation. (Which is nicely done, thanks to Clint Pierce, but the code requires more thought than my gnat-like attention span can muster at the moment)
Now, the current Forth, as it stands, is only a partial implementation, with the lack of control flow its biggest issue. I've been throwing new core words into it all day, in between handling the fallout from Parrot's directory restructuring today. It's dead-easy, and with the cool assemble-and-go bits of parrot (no need to even assemble the .pasm file, just feed it straight into parrot and you're good) there's not even separate compile and run phases. Can't get much easier than that. I snagged a draft copy of the ANS Forth standard (Draft 6, from 1993, so it's not exactly up to date, but I don't have Starting Forth handy) and have been going for it with some glee.
With it working, at least partially, there comes the urge to tinker. Meddle if you will, and alter the essential Forth-ness of the implementation. Having a combined int/float/string/PMC stack, rather than separate stacks, is the first big urge. Having strings as atomic data (rather than addresses and counts on the stack) is a second urge. Adding in OO features is the third. (OO Forth, after all, is less bizarre than an OO assembly) And integrating into Parrot's calling conventions is a fourth. I think... I think I may well do them all.
While I'm at it, I think I may well redo its compilation stage as well. Right now all the words with backing assembly just dispatch right to them, while the user-defined words are kept as a string, as if the user typed them in, and re-parsed and interpreted each time. Which is a clever way to do things which gets you up and running quickly, but as Parrot has the capability to generate bytecode on the fly, well... I think I might build a full-fledged compiler for this stuff. Which should also take very little time, be very compact, and awfully forth-y. We'll see what tomorrow brings.
Well, fun at least.
You may or may not know, but Parrot's got a simple but functional Forth implementation as part of it. Nothing fancy, just the base line scanner and math functions, but the compiler works so you can do things like 10 20 + . and get 30 as you'd expect, or : GIMME_30 10 20 + . ; if you wanted to package it up as a new word.
Anyway, I need a Parrot language a bit higher-level than plain assembly for work, and if anything counts as "a bit higher level than assembly" it's Forth. Heck, the standard doesn't even require floating point numbers. Or integers larger than 16 bits as base for that matter. (Though 32 bit integers are required to work so you have to fake 'em if they aren't there) So, since I've been fond of Forth forever I figured it's time to go extend the thing and add in the missing bits.
Which, it turns out, is (at least for the non-control-flow words) darned easy and really compact. End result is maybe a dozen or so opcode_t cells for most of these things, which, honestly, is just damned cool. Currently the forth implementation unrolls user-defined words to a sequence of primitives and then executes the primitives, but I think I may see about generating bytecode on the fly with the built-in assembly functionality. That'd be damned cool too. :)
Like so many things I go on about, keep in mind that this is from the standpoint of a VM designer/implementor--I understand and do infrastructure. I have precious little experience implementing applications, unless you consider something like Parrot an application. (Though in this realm I suppose it is, of sorts) I don't know how much of this will be useful, but even if all you come away with is a sense of bitter cynicism about text handling then I'll consider this worthwhile.
First piece of advice: Lose the conceit that Unicode Is The One True Answer. It isn't. It isn't even One Of Many True Answers. It is, at best, a partial solution to a problem that you probably don't even have. A useful partial solution (unlike others, not that I'd mention XML here... :) to be sure, but that's all it is. That's all it attempts to be, and it generally succeeds, which is fine.
The problem Unicode shoots for is to build a single character set that can encode text in all the world's languages simultaneously. Not make it easy to manipulate, just to represent. That's it. Do you need to do that? (Really? Really?) Odds are the number of people who do is vanishingly small, and almost all of us are living on the other side of the application divide. At best most people and applications need to handle stiff in the 7-bit ASCII range, possibly the full Latin-1 or Latin-9 set (with its accented characters and whatnot), and maybe a single language-specific set of characters. Unicode is, as a practical matter, unneccesary for most folks. Latin 1 or Latin 9 works for Western Europe, and North and South America, China's served by Big-5 Simplified, Taiwan by Big-5 Traditional (or vice versa, I can never remember), Japan by one of the JIS standards, and Korea by a KOR encoding. A good chunk of India may well be served by Latin-9 as well. That's about 4 billion people, give or take a few, who don't need Unicode in day-to-day computing.
Yeah, I know, "But what about data exchange? What about cross-language data?" What about it? Odds are you read english (as you're reading this) and at best one other language. You wouldn't know what the hell to do with Japanese/Chinese/Korean/Hebrew/Arabic/Cyrillic text if it was handed to you, so what good is being able to represent it? I mean, great, you use Unicode and you're in a position to take a vast number of characters that you have no clue what to do with. Swell. You now have a huge range of text to badly mishandle.
Second piece of advice: Try not to care about how your text is represented. It's likely your program really doesn't have to care.
Really. Think about it. When was the last time you had to actually care that a capital A has an ASCII value of 65? The overwhemling majority of the time you're dealing with character data symbolically or abstractly. You want to know if the string "ABC" is in your string data, or you want to make it uppercase, or you want to sort it. None of that needs to know, or care, that A is 65, and a good chunk of the code that does know is either busted or is (or ought to be) library code rather than application code. The string library code does need to know, but you don't. Well, unless you're writing low-level library code but I'd bet you aren't.
Yes, I know, there are languages that sort of force this on you if you use the native string (or, in the case of C, "string") types, but even then you usually don't have to know, and often you're better off paying as little attention to the way the base string type is represented as it likely sucks and is often depressingly broken.
I say this as someone who does have to make strings work right. It's a lot of work, sometimes bizarre and unusual work, filled with more edge cases than a Seurat that someone's played connect-the-dots on. Leave as much of the work to the libraries as you can. It won't be enough--it never is--but it'll make things less bad.
Third piece of advice: Don't deal with string data you don't understand. That means that if you don't understand Japanese, don't let japanese text get in. No, I'm not saying this because I'm elitist or I think you should live in linguistic isolation, but be realistic. If your application is designed to handle, say, personnel data for Western European companies, then you're probably fine with Latin 9, and using it restricts your input such that there's far less invalid data that can get in. If you're writing a Japanese text processing system, then one of the JIS sets is the right one for you.
In those (lets be honest, really rare) cases where you think taking this piece of advice will turn out to be restrictive, well, then take piece #2 and use an abstract character type and restrict it to the set you really need to begin with. You can open it up later if you need.
Fourth piece of advice: Make your assumptions explicit! That means if you're assuming you're getting english (or spanish, or french, or vietnamese) text, then don't just assume it--specify it, attach it to your string data, and have checks for it, even if they're just stubs. That way, when you do open things up, you've a reasonable chance of handling things properly. Or at least less wrong.
Fifth piece of advice: Learn a radically different foreign language. And its writing system. For folks coming from a European language background, one of the Asian languages are good, and vice versa. Even if you're never at all fluent, competent, or even haltingly unintelligible, you'll at least get some feel for the variation that's available. Believe me, there's nothing quite like moving from English (with its dead-simple character set and trivial word boundaries) to Japanese (where there are tens of thousands of characters, finding word boundaries requires a dictionary and heuristics, and there are still some valid arguments as to where word boundaries even are). I expect Chinese is similar, and I can't speak for any other language.
Sixth piece of advice: Be very cautious and very diplomatic when handling text. The written word is a fundamental embodiment of a culture and a language. You disparage it, or existing means of encoding it, at your peril. After all, China's had written language for well on six thousand years, give or take a bit--who the heck are you to tell them that they're doing it wrong? Telling someone their language makes no sense or their writing system is stupid (yes, I've heard both) is not a good way to make friends, influence people, or make sure your program works on the whole range of data you've so foolishly decided to take.
Seventh, and final, piece of advice: There is no 30 second solution to make a program handle multiple languages or the full range of Unicode right. (Okay, that's not exactly true. Step 1, bend over. Step 2, grab ankles. You can bet that step 4 isn't profit...) Honestly, don't try. If you're only going to make a cursory or half-hearted attempt at handing the full range of text you're going to accept, I can pretty much guarantee that you'll screw it up more than if you didn't try at all, and screw it up a lot more than if you made a real try at getting it right.
Aren't strings fun? Yes, I'm well aware that a good chunk of the practical advice is "If you don't have to, don't." The corrollary to that is if you do have to, make sure you know what you're doing and that you know what you don't know you're doing. Understanding your areas of ignorance (areas you should go find someone who's not ignorant to help you out) will get you much further than you might think.
Update: Yeah, this originally said Latin-1 instead of Latin-9, but it turns out I was wrong there--you can't do French and Finnish in Latin-1. The eastern european languages don't work with Latin-1 or 9 either, unfortunately, which does sort of argue for Unicode if you need to work across the EU.
Update #2:Latin 9 and latin 1 are near the same--Latin 9 drops some of the less used characters and replaces them with missing characters needed for french and finnish. Oh, and the Euro sign. The Latin-X sets are all ISO-8859 standard sets. More details here
Parrot's had the facilities in it to call native functions for quite some time (months, possibly upwards of a year) but we've really not used it any--it's just not solved any real problems for folks doing parrot development. Well, since I'm looking at parrot as a target for production work, I've started using it. At the moment, as part of the parrot repository, there are interface files for ncurses (base and form lib) and PostgreSQL.
There's an ncurses version of life in the examples/assembly directory as well, if you want to play around with it. (It's in PIR format, so it's a touch tough to decipher by hand, though if you go back some versions in CVS you'll find a more readable version) Useful? Well... no, not really, at least not at the moment. (Though I need the ncurses and forms stuff for work) Cool, though.
The PostgreSQL interface is also really keen, in its own way. (Though I'm already a touch annoyed with the connection scheme for Postgres. Polling. Bleah) It means, with a bit of code--like, say:
.pcc_sub _MAIN prototyped .param pmc argv .include "postgres.pasm" P17 = global "PostgreSQL::PQconnectStart" P0 = P17 S5 = "host=dbhost dbname=sbinstance user=username password=somepassword" invoke retry: P18 = P5 P0 = global "PostgreSQL::PQconnectPoll" invoke print "status: " print I5 print "\n" eq 3, I5, continue eq 0, I5, panic sleep 1 branch retry panic: print "Argh! Failed\n" end continue: P0 = global "PostgreSQL::PQexec" P5 = P18 S5 = "create table foo (bar int)" invoke P0 = global "PostgreSQL::PQresultErrorMessage" invoke print S5 print "\n" end .endYou can add a new table, foo, to your postgres database. Presumably other things too, I've just not written the PIR to do it. (Though the full PostgreSQL 7.3 C interface is wrapped)
Dunno whether this counts as scary, or really cool. Or, I suppose, both. :)