August 28, 2003

A dirty little secret of CS

Compilers are easy.

No, really, they are.

Not simple, mind, but easy. The two aren't the same thing, though often people conflate the two. (I'm not saying that optimizing compilers are simple, however. They might be, but as I've not written one yet I can't say for sure. The literature makes me think that they aren't, and I'm willing to trust it)

Raw parsing is a pain, but tools like yacc and Parse::RecDescent make that darned trivial for most things. Interpreting the parse tree you get back is, once again, really simple. As far as I can tell (and I'm 95% of the way through doing it for a language that I thought was going to be tricky) doing the transform from the tree to source of some sort, be it C, perl, assembly, or whatever, is also trivial in the majority of the cases, with the only trouble coming in for weird corner cases and source/destination semantic mismatches, but most of that can be dealt with by a bit of thought when generating the parse tree. (A good parse tree is 90% of the battle in handling things on the back end)

The runtime, honestly, is a much bigger hassle.

Posted by Dan at 07:00 PM | Comments (4) | TrackBack

August 26, 2003

So close....

Sobig disappointed, trending downward in deliveries. Copy 40K showed up at 3:53:54 EDT. I should put together graphs or something.

Posted by Dan at 08:31 AM | Comments (2) | TrackBack

August 25, 2003

sobig, sobigger, sobiggest

At this rate, I'll hit the 40,000 sobigs received mark at 2003/08/26 3:44AM EST.

Have I mentioned this has gone well past ridiculous?

Posted by Dan at 03:23 PM | Comments (3) | TrackBack

Optimizing Parse::RecDescent grammars

For work I have to (amongst other things) write a compiler, something to turn a locally used 4GL (and I use the term loosely) into perl. The language in question's pretty simple, so the parsing and compilation shouldn't be much trouble. The language doesn't even have named subs or multifile modules (it's all gosub/goto style stuff), and there's none of C or perl's "It's a statement and an expression!" stuff. Statements are statements, expressions are expressions, and if you want multiple statements on a line you have to separate them with colons, or spread them across multiple lines using the backslash as a line continuation character. There are no blocks, either--if your then or else clause has multiple statements in it you have to colon-separate them or use the backslash line continuation stuff.

To do this, I reached for the sensible first tool--Damian Conway's oft-celebrated Parse::RecDescent module. I'd considered lex and yacc, but that would require either doing the entire compiler in C (a language I really loathe for text processing) or write a two-pass thing with the yacc compiler emitting some intermediate tree and then write a perl program to transform it to perl code. Yacc would be faster, but since this is a one-off thing for each source module it doesn't (at least right now) seem worth it. Besides, whatever grammar I come up with for P::RD will more or less be usable for yacc, so there's not much to lose going this way to start.

Anyway, I'm not a parser guy--parsing languages involves syntax, and the long-standing joke is that I just don't do syntax. Still, you do what you need to, so I sat down with O'Reilly's "lex & yacc" book and the (quite excellent) docs for P::RD a few evenings to get the feel for grammars, and started in.

Now, P::RD, and recursive descent parsers in general, have a repuatation as being slow, or at least slower than their alternatives. I knew that going in, and my first succesful run against the largest source file I had handy (250K of source text) seemed to bear that out. I wrote a test driver program that takes the grammar filename and source filenames as parameters, does a little preprocessing (throwing out blank lines and combining lines with continuation characters), parses the file building a tree automatically using P::RD's autotree directive, and dumps that tree to disk with Data::Dumper. A first pass at parsing this file took 7 minutes 44 seconds of user time. Not good. (The dump, FWIW, takes about 4 seconds of user time) Time to optimize, and herein lies the tale.

The base grammar for this language was very simple. I went through the manual and did a straightforward translation from what was valid syntax to a P::RD grammar. Mostly constant text, and a few alternatives for some of the commands and statements. In those cases where there are alternatives, since P::RD does first-match not longest-match, I went with the:

    rule:  foo bar baz |
           foo bar
rather than the more terse
    rule:  foo bar maybebaz
maybebaz:  baz | #empty thing
for clarity. Sections are in the order they're in the reference manual for the language, and commands for each section are in alphabetically, as that's the way they're in the manual.

Since nearly 8 minutes of user time was far too long, it was time to optimize the grammar to see what I could see. Since not running a rule is faster than running it, I was shooting to reduce the number of failed rules that had to be tried before we hit the successful one. The assumption through this is that this monster program, the second-largest in the set, would have the closest to average frequency distribution of constructs and statements, so tweaking for it would be OK for everything else. (Something that's not been tested yet, but seems a good assumption)

First off, I changed the comment rule. It was /\..*$/, so I changed it to '.' /.*$/ as that was a simple thing, and I figured that comments didn't really warrant much time. That, alas, took the runtime up, to 7:56. Bad, definitely bad.

The next thing to try was some reordering of the section rules, to see if putting the most commonly seen section first would help. That helped, though less than you might expect, taking the runtime down to 7:55. Not much help.

Next a pass through the source to look for more reordering, at which point a bug was found--I was using the same rule name twice, for different things. I hadn't caught that since I'm not doing anything with the output yet. (The current assumption is that if it parses properly then the output is sufficiently OK to make it to the compilation part, which isn't written) This may or may not have been generating bogus output, but fixing it took the runtime down to 7:36. Better, but still horrid.

The single most common command in the source is assignment, so I moved the assignment rule to the head of the command list. That took the runtime down to 6:54. Woohoo!

In this language, there's no "assignment" statement as such, there's really the let statement--let foo = bar except that the let is optional. The rule was an alternation of let variable '=' expr and variable '=' expr. Splitting that into two rules, assignment and let assignment, cut the runtime down to 6:14. (I told you there's a lot of assignment going on...)

Still, that's slow. Bleah. Since this language is pretty simple, it was possible to do a quick transform of the source, whip it through sort, cut, grep, and perl, to come up with a frequency count of all the keywords in the program. (Yes, it could've all been done in perl, but pipelines took less time to throw together for this) The frequency distribution was not alphabetic, so the alpha listing of the commands was suboptimal. Moving the 10 most frequently done things (assignment was #1, follwed by if, goto, and gosub) to the head fo the command list took the runtime down to 5:17. Almost a full minute off the previous time--not bad at all.

But not good enough. I like fast, and 5 minutes is just not fast. So, taking the optimizing advice of the P::RD docs and FAQ, I turned some of my nice clear rules to less nice rules that failed less. (See those two rules at the beginning of this post? I went from the clear to terse version) A good pass on this took the runtime down to 4:40. Terse and less maintainable for a 10% speedup--works for me, especially since the language grammar's been fixed for a decade or so.

In that simplification, I didn't address the 'ifelse' rule. Like most other languages, you can do either "if expr then statement else statement" or "if expr then statement" and lose the else. The rule was:

   ifelse: 'if' expr 'then' statement 'else 'statement' |
           'if' expr 'then' statement
changing that to
   ifelse: 'if' expr 'then' statement elseclause

elseclause: 'else' statement | # Or nothing, really

cut the parse time down to 2:55. We do a lot of else-less ifs, apparently. Yay, much faster.

Can we go faster still? Yes, why yes we can. The one last big alternation rule that was in the grammar was the multistatement rule. It was

   statement: command ':' statement |
              command
That made sense, but changin it to:
   statement: command colonstatement

colonstatement: ':' statement | # or nothing

instead, skipping the expensive alternation, cut runtime down to 1:23. So we ultimately went from 463 seconds to 83 seconds, cutting 82% off the time to parse with some relatively simple fixups. (And, for the record, putting the rules back in alphabetical order, sections as they appear in the manual, raises the runtime to 1:49, which is a pretty big jump)

What does this tell us?

Well, first, failure is expensive. I know, I know, Duh!, but it bears repeating. Failure shows up both in rules that actually fail, and rules that succeed but after one or more alternate versions fail first. Scrutinize those rules! Make them as cheap as possible, since odds are (Unless you have a really skewed frequency distribution) that there will be far more failures than successes. Order the rules so you've got the best chance of actually hitting the correct rule first, up to a point. (If you have a rule that succeeds, say, 10% of the time but is horribly expensive to run, you may find it faster to put more likely to fail, but far cheaper, rules before it) Factor out common prefixes (like the "if expr then command" part of the ifelse rule, which is valid for both the if/then and if/then/else forms)

Don't accept lousy performance without a fight! I spent two days fiddling, and cut what was potentially 4.5 hours of user time per pass over the tree down to 48 minutes of user time. Or, if you prefer wall time, 9 hours to 1.5 hours, figuring about 50% CPU utilization. That only needs the equivalent of one and a half full passes over the source tree (and, as I don't have the actual compiler part written, I expect far more than that in run time during development) to make up for the time I took to fix the grammar. (Like it or not, 8 hours of time in office does not equal 8 hours of full-on work, alas)

Not to mention the reduction in complaints from folks when a machine they use is bogged down for days on end, as this is running on a shared machine, and it's nice to be stingy polite with shared resources. (Besides, the compiler phase will, no doubt, eat up quite a lot of time itself...)

Posted by Dan at 12:04 PM | Comments (10) | TrackBack

August 24, 2003

WTHI posts for the near future

I've been pretty swamped the past few weeks, as the parrot folks will attest, what with a job change and all. (For the record, I'm working for a company named Yarde Metals in Southington CT, a scant 12 minute commute away. WebEvent, my previous employer, is going strong, and if you need a web-based calendaring system I'd recommend taking a look at them) Anyway, I'm trying to get my schedule back in order again. Between Parrot and the (long-neglected) Cocoa with perl book I'm kinda swamped.

I do have a few WTHI posts in progress--the two current ones are on first-class functions and character sets/character encoding. If anyone's got other things they're curious about, post a reply here and I'll see what I can do. (I do have the previous list around somewhere, so if it's on there I'll probably get to it at some point)

Posted by Dan at 03:20 PM | Comments (2) | TrackBack

Math of Sobig.f

I'd say it was the aftermath of sobig.f, but the damn thing is still hitting at a touch more than three a minute. I expect the thing'll flare up on Monday once people start getting back into work and powering back on infected machines and getting back from vacation and reading infected e-mail.

This hasn't entirely been a waste, though. It's the last straw to get me to upgrade my infrastructure at home, so I'm off with $300 in hand to find a new x86 box with sufficient disk space and power to be my home server. Given that the reference machine is a 300MHz Celeron with 128M of RAM and 17G of disk (across three drives that shouldn't be tough. I'm not going to bother trying to upgrade the current one--there's no room for new drives, and I know better than to try and do a full personality wipe and recreate on this box. Much better to just get a new one, set it up, copy things over, and swap IP addresses.

That means I get to finally dump sendmail for something else (probably qmail, mainly because Ask uses it for the perl.org servers and I can leech handy config settings off him :) and get in place the blackhole w/backoff system I've been fiddling with to temporarily block the spam-sending machines. (Though I may well block them in the mail server level rather than at routing level, dunno yet)

If nothing else I need more disk space for the mail spool. I got very lucky this time, but when sobig.g or its moral equivalent hits, and it will, I think I'm going to need a couple of gig free to handle the first day's onslaught of crap.

Posted by Dan at 12:13 PM | Comments (2) | TrackBack

August 21, 2003

Sobig.f was so big...

How big was it? I've gotten nearly half a gigabyte of it and bounces from it in the past two days, in 16617 messages. (About 10% of the incoming traffic from this damn thing is in bounce messages from well-meaning mail filters, traffic I could really do without, thanks) There's probably more, but there's a limit to the number of elements I feel like throwing into the filter regexes for my procmail log.

I'm really thinking a short-time autogenerated blackhole route manager tied into my mail filters is in order--if someone sends one of these damn viruses they get a 1 day timeout. (No packets for you, mister!) Given that I was seeing the same IP address delivering 5 or 6 of these damn things at a time I think that might well have cut down on the noise.

I'm getting closer to wanting ISPs to generally block outbound port 25 except to the local mailserver, even though that'll be damned inconvenient to me. I will pay for a static IP and low-grade business line if need be. (I've already got the static IP, and I will, reluctantly, pay more for outbound port 25 capabilities if it means that my DSL provider, and all the others, generally block 25...)

This probably means a wave of spam from the zombie machines is coming too. You can't imagine how thrilled that prospect makes me... :(

Update I apparently added too soon--it looks like procmail doesn't record the size of the entire message when it's sent to /dev/null, so the 12.5K that were directy binned only showed header sizes in the log not full message size. Looks like the total bytecount is off by about a factor of five...

Posted by Dan at 02:36 PM | Comments (1) | TrackBack

August 20, 2003

Time for more tunnelling fun

At least as soon as I figure this out. I've managed to get my 2.2.somethingorother linux box to NAT the boxes behind it so they can all get out to the 'net. Which is a nice thing, as web proxies only go so far. Now, though, I've got another issue to deal with, and that's allowing machines behind the NAT to act as servers to systems elsewhere on the intarnet.

I'm assuming that the way to do this is to take an unused port on the linux box and transparently gateway it to the SSH port on the machines behind it, but I'm not sure how exactly to do that and maintain all the appropriate links and whatnot. I'm sure there's some simple ipchains voodoo to do this, but beats me what it is.

Posted by Dan at 03:29 PM | Comments (2) | TrackBack

August 19, 2003

More rounds of nonsense

Well, the blaster worm and its variants (including the "helpful" version that patches you for yourself) is making its rounds, though I think at this point the thing's dying down, which is good. One interesting side effect may well be a massive upswing in the number of computers infected with viruses. I've gotten more than a hundred "your virus has been infected" bounces so far this morning, and I'm finding it hard to believe that so many postmasters but so few people have shiny fresh virus signatures. Seems more likely that the "reinstall windows" recommendations have been taken but folks either forgot to reinstall the virus scanner, or haven't grabbed the multimegabyte signature update for them yet.

Not that reinstalling windows in this case is a bad thing--given that the hole that blaster exploits has apparently been exploited by other, rather more nefarious and silent, programs for at least two weeks before blaster showed up, a full reinstall is likely a good thing. (Though there is always the issue of figuring out how to get a machine that's remotely exploitable patched when you have to be on the net to patch it...)

I'd have some hope that people will update themselves and get this taken care of, but I've looked at my logs--if folks haven't cleaned up after Nimda, Code Red, and Slammer yet, why the heck should I expect them to clean up after this one?

Posted by Dan at 08:08 AM | Comments (0) | TrackBack

August 12, 2003

IP masquerading made easy

Turns out to be even more dead-simple than I thought. (And I don't even have to reconfigure any ethernet addresses anywhere, just the gateway address the DHCP server hands out)

This made it easy, which is cool. Two commands, a single line in the dhcpd config file, and a dhcpd restart. Sweet. The hardest part was figuring out the netmask.

Posted by Dan at 07:25 PM | Comments (0) | TrackBack

August 10, 2003

The end of an era

And a drop in my geek quotient. After nearly 20 months, someone finally noticed that the DSL line a previous employer had installed for me wasn't supposed to be up. (I suspect it came out in the bankruptcy proceedings, but I don't know for sure. I wasn't a contact for it, nor was I actually involved with the thing in any way) And so, we're now a single DSL household. Not a big deal, as I've been anticipating this for, well, 20 months and none of the house services hang off that line as anything other than a failover, but it does mean that I need to finally get around to installing IP masquerading on the linux server so the house clients can get out for things other than HTTP. (I'm at Starbucks right now so I can do some final sync ups with the work CVS repository)

This does mean that anyone trying to get to glastig is going to be out of luck until I can either arrange for an alternate way in (probably bouncing through tuatha) or I get a multi-IP pack from SNET. (I'm currently down to a single IP address)

On the up side, I now have two compatible SDSL boxes. I probably ought to see if some of our friends are in my local exchange and see what getting a dry pair would cost...

Posted by Dan at 10:57 AM | Comments (0) | TrackBack

August 04, 2003

Surreality

A few weeks ago, in a small spate of impulse music buying (damn you, iTMS! :) I tracked down and ordered a copy of Luther Wright & The Wrongs "Rebuilding the Wall", a Canadian bluegrass reworking of Pink Floyd's The Wall. It's... odd. Definitely very odd. And somehow appropriate.

Posted by Dan at 11:02 AM | Comments (2) | TrackBack