November 15, 2007

Configuration data

Configuration systems are a collection of data, a tree of rules, and a set of templates. (Which are arguably just funny rules, but it's convenient to treat template instantiators and rules as separate things) What rules get fired off and when depends in large part on the data. Lets, then, consider the data.

For a configure system we probably don't care much about what type an individual data element is. int, float, boolean, string, whatever -- ultimately these things will get turned to strings and splatted into templates, so adopting a dynamic-language-style morph-on-demand basic data type is a reasonable thing. So we'll do that.

There are things we do care about aside from the actual value of a data element, things we use to decide what rules need firing off. For example, we care if the element:

  1. Has been initialized
  2. Has changed
  3. Has a default value

All of these things are needed to help decide which rules to fire off.

#1 is straightforward enough. If a data element has no value, then we can't use it yet. That means when we're ordering rules to see which we're going to run, we need to run the ones that set the value before we run the ones that use it.

#2 is also straightforward. If a rule depends on X, and X changes, then we probably need to re-run the rule. The exception here being that we don't need to run the rule if the rule doesn't actually do anything we need. For example, if rule Foo depends on X and provides Y, and we don't need Y, then there's no point in firing the rule off. Assuming we don't allow side-effects. (Which we aren't, but I'll get to that another time) Whether a data element has changed can also be an interesting question -- who decides? Does the solver engine decide, by comparing the old and new values? Or does a rule affirmatively declare that it's returning X and yes, in fact, X should be considered changed? (And what about the cases where it returns X, with a changed value, but says it hasn't changed? We'll get to that another time too)

#3 is somewhat less straightforward. Configure systems generally don't start from nothing when they run -- there's seed data that has adequate but not necessarily optimal values that you can prime the system with. As a for example, you may not know what the best C integer data type to use is, but it's certainly safe to default to "int". The system may probe for a better value, perhaps long or int64 or something, but in its probing it can use int until it finds a better answer. Or you may, by default, assume you can't fork (a safe assumption) but you can always come back and probe to see if you can.

The utility here is that you can ship safe defaults, enough to get the system in a state to go find better answers. Sure, maybe you've filled in the defaults with lowest-common-denominator answers from the C89 and ANSI operating system standards, but that's enough to get you in a position to figure out what the actual useful answers are.

Alternately you could mark values loaded up from the default cache as changed, but... maybe not. There are actually cases where you don't want to mark a default value as changed.

Consider, for example, the person who's actually packaging up things for distribution. We've kind of hand-waved up until now on how these 'trivial' shell scripts, batch files, or default config.h (or whatever) headers and such get created, but this is a good time to un-wave the hands.

The packager could run the build on each OS you want to support and collect up all the results. But... ewww. That means the packager needs to be minimally fluent with each system, and have access to each system, and if you lose access (or lose the person with fluency) you have a system fall off the list. Not good.

On the other hand, the packager has a perfectly fine configuration system handy. If there's also a set of default values for each system (which someone else can build, can be handed down from maintainer to maintainer, or set to something really simple) then the packaging can just load up the defaults, generate files, and then go on to the next system. If you treat the defaults as unchanged then, if your rules are correct, you won't need to actually run any probes and you're good.

When the end building person gets the package they run the generated simple shell scripts which builds the configurator with the defaults, which then loads in the default values, starts probing, and rebuilds itself. Same templates, same rules, the only difference is that the packager doesn't do the probe and rebuild step, since they don't have to. (Or, rather, if they do it's a sign there's a missing default)

None of this is all that interesting, but it's handy to have as a base when looking at the more interesting bits. Which we'll do later.

Posted by Dan at 06:57 PM | Comments (1) | TrackBack

November 07, 2007

Choose your (runtime code instantiation) poison

Continuing on from the previous post, let's consider some of the functionality of a configure system. You will, inevitably, want to instantiate some rules at runtime. You don't really need to do this -- a code generator the release manager runs could generate all the code explicitly from templates -- but let's be honest; you're going to want to.

You'll want it for cases where you have a template rule that needs to have a separate instantiation for each file in a directory. You'll have it because when you're designing your rule language you're going to want run a rule for each element in a list or array which means you really need separate instantiations of the rule for each element. You'll have it because it's a shiny thing and programmers just can't resist Teh Shiny. (Admit it, you know that's true)

There are four different ways to handle this.

You could consider each rule as a closure, close over the environment, which includes the current array entry or filename or whatever.

You could have a way to clone an existing rule and make changes at clone time, basically a rule factory.

You could consider embedding the configuration system's compiler in the configuration engine and just treat the code as text, and recompile things on the fly, like perl's string eval.

You could consider a partial compilation system with placeholders, and basically implement Lisp macros to dynamically generate rules on the fly.

Each of the options has its own strengths and weaknesses.

The closure option is the least attractive of the three options. What you're generally closing over -- what the rule depends on, or produces, or the files a template operates on -- is really more metadata than data. And yes, you can close over that, but it feels kind of awkward. Which isn't to say that closing over the environment is a bad thing, even in a functional language with no side effects (which I hadn't mentioned, but there you go), but it's really not the right thing here. The only real upside is that this should work, but since it's hacking an existing concept in a way that's not particularly true to the concept

The rule factory's more interesting. It's a very limited sort of thing, specialized to do one thing and presumably do it reasonably well. That's not bad, if what you want is what it does, and in this case it is. I'm not all that fond of limited solutions like this, though, as they often feel like someone just stopped thinking and made the best of what they had thought about.

The compiler option's a common one, and it definitely has its advantages. The big downside is it means that new rules mean doing text mangling and then compiling the result -- we've got to have the compiler handy, which makes things more complex. Plus this kind of text filtering can be problematic, as we've seen in perl. (Though as we've established, one thing that a configuration processor is going to have to be good at is template processin) On the other hand, having the compiler handy means that you can provide the rule system as text so if something goes wrong the clever end user can fiddle with it to make it right, without penalizing the end user who doesn't want to be clever.

Then there's the lisp macro system, though it's mis-named, or at least deceptive these days since most people think of macros in terms of text, or recorded little action things in a spreadsheet. Basically you manipulate the compiled code, copying, adding and potentially removing chunks of executable as you string it together into a final form. Very much like the rule factory, actually, except without stopping thinking too soon. (The rule factory is, in fact, a subset of the macro system) Creating a new rule is just a matter of copying the code for an old rule and making some judicious substitutions.

Personally, I find I like macro system form best. It doesn't have to be completely exposed in the actual configuration language, merely implemented that way. (Much the same way as compiling Fortran into object code that does all its control flow via CPS -- the fact that the underlying implementation involves continuations is irrelevant (kinda) to the compiled language, which doesn't. It just means you can potentially do Clever Things with less pain)

It also just feels like the right thing to do. If the config system is potentially going to be dynamically generating dozens, or hundreds, of rules on the fly and pitching them at the solver engine, something a bit less haphazard than text substitution and recompilation seems in order.

Posted by Dan at 10:38 PM | Comments (0) | TrackBack

November 04, 2007

Configurafication

So, on the near-anniversary of the last post, I've been thinking about configuration systems. What they are, what they do, and how you make 'em work.

They're one of those things most people never think about, or if they do it's at best as a user. You probably ran the autoconf-generated shell script, or did whatever else the package you installed needed. It did some strange magic, built some files, and you were done. Easy, right?

Well, okay, maybe not so much. If you've ever poked around inside the guts of an autoconf-generated script you're probably aware there's more concentrated evil in there than you'll find outside the source for Clippy. It's just nasty.

That's not really a surprise; it is a multi-unix shell script designed to probe the environment of a system it knows so little about it can't even really count on a properly functioning shell. And yeah, I know, it's a lot better than it was, but still... Ewww. Just because the insides are justified (and possibly obligatory) doesn't make it any less Lovecraftian. (So I'm not knocking autoconf, it just scares me. A lot)

As a developer, configuration systems are a pain, for a few reasons. The first is portability -- by their very nature, configuration systems go look for things that aren't on the system you're developing on. (Because if they were you wouldn't need to go look to see what they are) Writing portable code is annoying, because it requires making good assumptions from the beginning, and that's tough. Especially because, if you're at all comfortable with the system you're working with, it means making assumptions that are different than your default ones.

The second big pain with config systems is that it's a hassle to track down bugs, again because it involves systems different than your own. How can you tell that the code you built on your 32-bit x86 linux system doesn't work when configured and run on a 64 bit AIX box with more than 4G of memory if you don't actually have a 64 bit AIX box with more than 4G of memory?

And the big final annoying thing about configuration systems. Like test frameworks, they're probably completely different from the app you're actually writing. That is, there's likely little or no overlap between the mindset needed to do thing you like to do (writing the game or chat client or whatever it is) and the thing you need to do (write the configuration system). Unless you're writing a configuration system, in which case it gets all meta.

Then there's the whole not knowing what you actually need to go look for. Most of the annoyingly quirky things (like the PDP-11's wacky 32-bit integer format, which is neither big nor little endian, but rather middle endian) you had to deal with in the past have thankfully died enough that you don't have to care, but if you've never had to deal with the vagaries of how shared objects are built and work on a half dozen systems you'd never even think to go probe for them.

All of which is why people generally reach for autoconf, and you really can't blame them. But autoconf is evil, and for a complex system it's inadequate as well, since there's a damn sight more than just a makefile and config.h that you need to build up when you're configuring.

So anyway, configuration systems. What the heck are they, anyway?

If you think about it, there are four bits to configuration systems:

  1. Rules
  2. Probes
  3. Template instantiation
  4. Seed data

Seed data gives you sane, functional (though possibly only barely functional) defaults. Probes gather information from the environment to override the defaults, rules decide what probes need to be triggered and what values are produced, and instantiated templates are what you end up with. Possibly in an iterative way, since a lot of environment probing involves instantiating templates (that is, little C programs) which are compiled and possibly run.

It's all very dependency based; to instantiate a template you need to have all the input values, which different rules produce, and some template instantiations depend on other instantiations, which depend on yet other instantiations, and so on. Not all that much different than what make does, only with built-in actions (really built in, into the executable, rather than predefined ways to invoke programs) and generic dependencies rather than time-based file dependencies.

Or, if you like, the mutant bastard child of Prolog and Template::Toolkit.

More interestingly, if you're going to build a configuration system, you can actually manage to do so from scratch, with a bare-bones shell script or batch file, a C compiler, and the linker, which is a pleasant change from the past, where you couldn't. What I mean by this is that, if you assume a C89 standard C compiler (which, given that it's been 18 years since that standard was made seems safe), you can manage to get everything you need to probe the environment.

Think about it. What do you need to build our configuration system? You need to read in and parse rules data and templates, you need to do dependency ordering, you need to instantiate templates, you need to spawn off subprocesses to check individual bits and pieces, and you need to read in the results of those probes. That is, you need fopen, fread, fwrite, fclose, system, and a boatload of templates. Everything else is built into the configuration progam.

All of those things are guaranteed in C89. (And yeah, I know, there may be systems that don't handle them all right, but at this point it's pretty safe to assume they all do) Putting the code together so it compiles and links with close to no knowledge of the system is pretty straightforward too. A series of cc commands followed by ld for unix and its like, for Windows you use whatever the most common compiler is, or you use some environment variable to do it (and have a prompt at the beginning of the batch file for the C compiler and linker -- you can be pretty sure that if someone's running a .bat file it's windows, after all)

Yes, that does mean that maybe you're not using the compiler and linker options the user really wants, but that's fine, because once you've built yourself you just go ask and do it again. Or provide a way to set some environment variables or command line parameters if that's your preference. Worst case there is you need to compile everything twice, but you're probably going to do that anyway, since a full-fledged configuration engine needs to be able to do system-specific stuff, which means probes and rebuilding. The good bit about all that is that you don't need anything to configure your configuration system. Which is good, because otherwise you end up chasing your own tail, and that's annoying. (not to mention hell on your back)

And, of course, once the engine is already built, anything else that uses the engine for configuration already has a prebuilt set of values that can be pretty easily filled in to skip most of the probes. (Or you can package the config system with whatever needs it and just build it, since a half dozen or so extra compiles and links aren't going to noticeably increase the length of time it takes to probe the system for settings)

There's more, but this is enough for now.

(This is all Jim Keenan's fault, I should mention that, in case anyone's curious)

Posted by Dan at 04:03 PM | Comments (1) | TrackBack