February 10, 2004
Compilers. Not, it turns out, a difficult thing
So I'm sitting here working on a "write a compiler for parrot" article for O'Reilly, partly because they asked and partly to nail down all the bits of compiler info that I've gathered as I've been working on the compiler for the office.
Compilers are easy.
No, really, they are. Not, mind, that I'm saying that optimizers are easy, as they're not, sorta. (Well, OK, they're actually not all that tough to implement, once someone else has done the hard work of creating the abstract optimizations. That, like any other brain-warping creative act, is damned difficult, though the simplicity of the result often makes it seem less tough than it is. e = mc2 seems pretty simple too, though you'd be hard-pressed to find more than a small handful of people who could come up with it if someone hadn't already. But I digress) But compilers? Nah, no biggie.
Sure, there's some art to it. Creating a grammar that produces a nicely usable result makes it much easier, but that's not too tough to do once you've done one or two and gotten a feel for it. I heartily recommend implementing your own simple compiler before tackling a more complex one. You'll spend less time and less effort writing a simple compiler then a complex one than you would just writing a complex compiler. (I do not kid) Still, if I managed to figure it out (and I hate parsing) you can. Useful hint -- grammar rules should either produce a simple value, delegate to one of several single rules, or be funky. It's easier to handle:
parenexpr: '(' expr ')'
simpleval: parenexpr | stringconstant | numericconstant
than it is to handle
simpleval: '(' expr ')' | stringconstant | numericconstant
since in the former case the simpleval rule can unconditionally delegate to one of several simple rules, each of which have just one form, while in the latter case it needs to figure out which form it's got and which piece to delegate.
And once you've gotten that nice grammar to build you a tree of nodes? Then the compiler just emits a boilerplate header, delegates the processing of each statement node to its handler (which probably delegates to its handler, and then emits a boilerplate footer.
I keep waiting for the other shoe to drop. This stuff really can't be that easy, or everyone'd be doing it all the time. I have this nasty feeling that I'm underestimating the amount of work that goes into writing a parser generator like Parse::RecDescent or yacc here. (I do have some idea how much trouble it is to write an engine to target, but honestly Parrot (and .NET, and the JVM) aren't that difficult to do. Sure there's a lot of work there, but not much complexity)
Ah, well, I need to write faster, both so I can finish the article and nail down the grotty bits of grammar and compiler creation before I forget.
Posted by Dan at February 10, 2004 01:32 PM
| TrackBack (0)
A few notes -- first, the way you lay out your grammar rules, and the parser generator you use, will influence error detection and recovery. During error recovery the data structures constructed by the parser must be discarded. If it's not done properly, this may lead to various problems (e.g. segfaults, as a couple of bugs recently demonstrated in perl 5.)
Second, don't forget the tokenizer. It does a job too. Typically, manage a symbol table whenever it sees an identifier declared or used; or open included files. Perl 5's parser and optree builder are largely simpler than its tokenizer. (But Perl 5 is an edge case.)
Third, writing a compiler generator is not hard. The common algorithms are well documented. Compilers for real languages are more difficult to write than compiler compilers :)
Uhm. Ever looked at the list of languages? Everyone _is_ doing it! :-)
Heh. You might think so, but the number of people writing languages is, in general, pretty darn small. On the one hand this is a good thing--how many badly done mutant variants of ALGOL do we need? On the other, folks often go to horrible lengths to work around problems that'd be solved with a domain specific language.
I suppose, when all's said and done, we'll end up with a variant on the old journalism quote: "Inside every programmer is a language waiting to get out, and that's the best place for it". But still, it's often (as I'm finding at work) handy to be able to write a compiler or two.
Speaking from experience using Yacc / lex / java cc for 10+ years.
1. Even parsers are not easy. Yacc would choke on a language in which you need to lookahead more than 1 token ahead.
2. After you've managed writing YACC/Lex parser to produce that AST - the major challnges are ahead. (Such as going from your AST to reverse polish notation for example. And note - this is *not* optimization).
I suppose that when you say "Compilers are easy", you actually mean "Parsing is the easiest part of the compiler ;-)".
If this is what you are saying - then I agree. Sorta.
(Still, many people drop LEX as a lexer and some drop YACC as well (bottom-up parser) in favor of top-down parsing)
"And once you've gotten that nice grammar to build you a tree of nodes? Then the compiler just emits a boilerplate header, delegates the processing of each statement node to its handler (which probably delegates to its handler, and then emits a boilerplate footer."
That 'just' is just plain wrong. Think about reverse polish notation, for example.
Well... when I said "compilers are easy" I meant that going from source to PIR for parrot was easy. That is, I freely admit, a relatively limited domain, as my destination is fixed and a reasonably nice target to compile to. I'm also using a top-down parser at the moment, since I'm using Parse::RecDescent. Going to, say, postscript or scheme might be a bit more of a hassle, and a bottom-up parser might emit a tree that's less easy to manage. I can also see multi-token lookahead making life more difficult
Still, I'm not really sure that's the case--at least if it is I've not seen evidence of it yet. There's no reason that handling infix, prefix, or postfix source should make much of a difference
My potentially overwhelming naivite may well be an indicator of the size of the Big Shoe, waiting to drop when I least expect it. (Certainly wouldn't be the first time)
The (only) challenge in writing compilers (and I've written a few) is the appropriate "balancing" between several processing layers, which make a compiler.
To prevent Big Shoe effect, may I recommend reading "Compiling for the .NET CLR" by John Gough?
The guy has 20+ years of writing compilers, yet in his book you will find an examples of "here we need to guess because there is no way to say before we try" things.
That is (of course) not because he is stupid. That is because he is smart and experienced. Honestly, it is really good book about simple compilers.
As to modern 'complex' optimization compilers - that's of course totally different story and yes, messing with L1-cache is involved too.
I'm sorry if this sounds a bit fuzzy.
I'm looking forward for your article for O'Reilly, I am sure it would make a great reading!
No, compilers are easy, I agree with Dan entirely. I'm (hopefully) doing a talk at the next set of Perl conferences that show how to make them even easier.
They're a lot of work, though. It's quite tedious running through each nonterminal in your parse tree and writing a generator for it. The best thing you can do is treat the code like code. Many times treating the code as a data entry task is the right way to do things, but not here. You will benefit greatly from a role that implements almost-identical behavior in several nodes.
And there are a couple of idioms that really help you out, too. But I have to leave something for my talk, right? :-)
You can actually deal with a lot of the hassle of non-terminals with optimized grammars and delegation--for example, if you've got the rule:
foo: bar | baz | xyzzy
then the rule to handle foo can just delegate to the subnode rule that actually matched. Granted, you can't do that all the time, but it takes a lot of the hassle out of things.
FYI, a couple of URLs on parsing are: