July 14, 2005

WWIT: Making the destination exist

Here's one that caused much consternation at times. I ruled that the destination of an assignment must always exist. That is, code that looks like:

add Pdest, Pleft, Pright

needs to have a real Pdest -- that is, it's an in parameter, not an out parameter.

Some people hate this. "Why can't my add function create the destination?" they cry.

Well, because in many cases the destination already exists. Creating a new PMC unconditionally means generating a lot of temporary garbage. More temps means more garbage for the garbage collector, more temps needed from the allocator, and more time spent in bookkeeping with these temps.

Consider, if you will, these two scenarios:

a = b + c

and

d = e + f + g

where we're only interested in the "e + f" part.

In the first case, the destination already exists. (We can't just arbitrarily whack destination name/variable bindings, because the destination might be active data. That warrants a What The Heck Is post of its own. Active data has been the bane of many optimization strategies, but it's also phenomenally useful, and more importantly an integral part of Perl so there's no ignoring it) There's no need to create a temp for the "b + c" addition, as we already have a spot to stick the result -- the PMC for a. There's a possibility in the first case that the destination PMC is of an inappropriate type, in which case a temp PMC needs to be created and handed to the destination PMC for ultimate assignment/value snagging.

In the second case, a destination doesn't exist. Since this is known at compile time, though, the compiler can just emit code to create a temp of the appropriate type and we're fine there, or emit a generic Undef if the compiler doesn't know. (Depending on the language it usually will know the appropriate type to emit)

So we've got three possibilities -- destination exists and is OK, destination exists and isn't OK, and destination doesn't exist. In the second and third cases we have to create an intermediate PMC, and in the first case we don't. If, on the other hand, we had the add op unconditionally create PMCs, then the first case creates a temp PMC which is then discarded. More garbage than we need.

Things get more interesting with extended expressions, like:

a = b + c + d + e + f

If add creates a temp, then we have created 4 temp PMCs. (for the b+c, that result + d, that result + e, and that result + f) On the other hand, if we make the destination exist, we need exactly two temporary PMCs. In fact, we need only two temp PMCs no matter how complex the expression is, as long as there are no function or method calls in it. That's a lot less garbage, especially if you're in a loop.

You may be thinking "two temp PMCs? How, exactly?" That's easy. There's a temp needed for the b+c part, and a second for the result + d. The first temp can be reused again to add e in, then the destination to add in f.

"What about continuations" I hear you cry. (Though, I suppose, not necessarily about this specifically) "They could happen anywhere!" No, they can't. They can only happen explicitly, or inside functions or method calls. Parrot specifically forbids continuations to escape from within vtable functions so there's no way our expression there can be resumed in the middle, which means that the compiler's free to reuse its temps. Consider, for example, the following:

foreach a (@foo) {

b = b + c * a

}

Where we're iterating through the @foo array. No function or method calls, which means there's no way that continuations could be taken. If there are a thousand elements in @foo, if we have our math ops create temps, we've created two thousand temps. If we have the destination exist, we have to create... two. Even if we're Really Clever and explicitly free up the created temps, that's still two thousand calls to the allocator and two thousand calls to the free routine, instead of two to each.

So there you go. If the destination has to exist it means that, worst case, performance is no worse than if binary ops create their own PMCs, and in some case significantly (orders of magnitude) better.

Posted by Dan at 05:24 PM | Comments (5) | TrackBack

July 06, 2005

WWIT: Universal bytecode

From the very start, we declared that we'd have a mostly-universal bytecode format. That is, assuming you built parrot to use 32 bit integers for opcodes, bytecode you built on one machine would run anywhere. Not necessarily without translation, but parrot would provide that translation automatically.

Why? Simple. Binary distributions and multi-platform shared installs.

Now, before you get your freak on over the lack of source, a traditional perl community hot-button issue, remember that parrot's a multi-language interpreter, which means you might not have the compiler for the language in question. Just because you've got parrot with the perl and python compiler modules doesn't mean you've got the ruby, prolog, C#, BASIC, and Intercal modules installed, so you're kind of out of luck there even if you do have the source. (And the source can be embedded in the bytecode as metadata)

There are also times when binary installs are better, even with complete internal distributions -- you don't have to worry nearly so much that Joe from Accounting will use his "mad programming skillz" to helpfully fix the bugs in the app you're deploying. (You know those bugs -- the ones keeping Joe from trashing the database and destroying everything done since the last backup)

And, of course, combined with a linker, being able to do universal bytecode means you can link your program into one big file with all the bytecode for all the libraries built in and distribute it so people only need a base parrot install and nothing else. (Or you can then run it through the bytecode->executable converter to get a single-file executable) You are, of course, completely responsible for the social implications of that (including the legal bits) but we only do the technical bits, social things are your problem.

The multi-platform shared install is important as well, and how much is something you don't tend to notice until you've had to manage a shared install of some application that's used on multiple operating systems and hardware platforms. (Though this is becoming less common as the various hardware platforms and operating systems die out) That is, you've got a shared app install on some NFS mounted volume somewhere and all the systems on the network use the shared install.

Now, of course for this to work you need to build your main system (which in this case would be parrot) on all the different platforms, which is a pain. It also means that all binaries need to be built on all platforms, which is a really big pain when upgrade time comes if you've got modules that have binaries.

Parrot's NCI system should make modules with a C component less common, but it's still handy to do compilation to bytecode, and universal bytecode means you get to do this once and deploy it portably. This is useful in cases where compilation from source is slow (if, say, you've got a language with a strong optimizer, slow compiler, or one that triggers degenerate behaviour in parrot somewhere) or where the compiler module itself is platform dependent but the output bytecode isn't.

Anyway, the more you can share across platforms without having to do anything at all special, the easier it is to pass things around and make everyone's life easier. That's a good thing, so far as I'm concerned.

Posted by Dan at 02:53 PM | Comments (2) | TrackBack

June 18, 2005

WWIT: Calling conventions

If you've looked you might have noticed that Parrot's calling conventions are somewhat... heavyweight. Not particularly bad as these things go (they're actually very similar to the conventions you see on systems with lots of registers such as the Alpha or PPC) but still, heavier than folks used to only stack-based systems are used to.

As a recap for those not intimately familiar with parrot's calling conventions (as they stood a while ago at least -- things may have changed) the first eleven of each type of argument (PMC, String, Integer, and Float) go into registers 5-15 of the appropriate register type. The count of parameters in each register type go into integer registers 1-4, Int register 0 gets a true/false value noting whether this is a prototyped call or not (meaning that non-PMC parameters are being passed in basically), P0 gets the sub PMC being invoked put in it, P1 holds the return continuation (this can be filled in automatically for some invocation ops), P2 holds the object the method's being invoked on (if this is a method call), P3 holds an array with any extra PMC parameters if there are more than 11, and S0 holds the name of the sub you're calling (since subs may have multiple names)

Seems complex, doesn't it?

Let's think for a moment before we go any further. When calling a function, what do you need to have? Of course, you need the parameters. You need to have a place to return to. There has to be some indication of how many parameters you're passing in. (At least with perl-like languages, where the parameter list is generally variable-length) You need some handle on the thing you're calling into. Per introspection requirements perl imposes, you need to know the name of the function you're calling, since a function may have several names you need to know which name you're using when making the call, and if it's a method call you need the name of the method you're calling so you can look it up. If you're calling a method on an object you need the object. (And you thought this was going to be simple...)

The only required elements for a sub call are the count of PMC parameters, the prototyped indicator (which you would, in this case, set to unprototyped), the sub PMC, and the sub name. The parameters themselves aren't required since you don't actually have to have any. The return continuation can be autogenerated for you if you so choose, so it's not on the list.

So. Sub name, Sub PMC, prototype indicator, and parameter count. Not exactly onerous, and unfortunately required. No way around that. The biggest expense you're going to have is shuffling some pointers and constants around. (And, while I admit I resent burning time, it's hard to get too worked up about four platform natural sized integer moves per sub call, one of which, the sub PMC, can potentially be skipped if you fetch it out of the global store into the right spot in the first place)

The extras are just that -- extras. If you choose to do a prototyped call you need to fill in the counts for the other arg types. If you choose to not take advantage of automatic return continuation creation you need to create one and stick it in the right spot. If you've got way too many parameters, you need to put them into the overflow array. That's it, though.

The first thing anyone does when they look at this is want to start chopping things out. The problem is that there's really nothing to cut out. You can't chop out the object for method calls, that's kinda needed. You can't chop out the PMC for the sub being called, since you need a place to go. You can't skip using PMCs for subs for a number of reasons, which warrant their own topic, so I'll put that in a separate WWIT entry. You can skip the parameter count if you have functions with fixed parameter signatures (which we don't) or if you use a container that keeps count for you, which just pushes the cost off somewhere else (and ultimately makes calling more expensive, since you then need to move parameters out of the container and into registers). You could skip the whole prototyped thing, but in that case you either always use parameter counts or lose the ability to have non-PMC parameters. You can't chop the sub name out, since then you can't properly introspect up the stack to find the caller names (as any particular sub PMC could have multiple names) You can't chop out the return continuation since you need a place to return to when you're done. You can't chop out... well, we've run out of things to consider chopping out, and the best we've managed is to potentially change how the actual parameters are passed, but that doesn't make things cheaper or easier, it just shifts the cost and adds a little extra overhead.

Aren't engineering trade-offs fun?

Oh, and you can't even count on the sub you're calling being singly or multiply dispatched, so you have to leave the dispatching entirely up to the sub/method PMC being invoked. The HLL compilers can't emit code that assumes one or the other dispatching method. ('Specially since the method may change from invocation to invocation of a subroutine, as code elsewhere screws around with the definition of a sub)

Posted by Dan at 09:42 PM | Comments (4) | TrackBack

June 13, 2005

WWIT: Fast interpretation

Parrot is, as an interpreter goes, pretty damn fast. Not as fast as it could possibly be, a lot faster than many interpreters at what it does, and can be faster still. (Heck, you can enable optimizations when building Parrot and get a good boost -- they're off right now since it's a pain to pull a core file for an optimized binary into a debugger and do anything useful with it) A lot of thought went into some of the low-level design specifically to support fast interpretation.

There are a couple of reasons.

The first, and probably biggest (though ultimately not the most important) is that I thought that building a cross-platform JIT was untenable. That turned out not to be the case, at least partly. Building a framework to allow this isn't as big a deal as I thought. That doesn't mean you get a JIT everywhere, though. (You want to write a Cray JIT?) Getting a functional JIT on a platform and maintaining it is definitely a big undertaking and, like any other all-volunteer project, Parrot's engineering resources are limited and somewhat unreliable. Getting an interpreter working relatively portably was a better allocation of those resources, leaving the JIT an add-on.

The second, and more important by far, reason is one of resource usage. The rest of this entry's about that.

Perl, Python, Ruby, and PHP are used heavily in server-based environments, and if you've ever done that you know they can be... slow. Oh, not all the time, and there are ways around the slowdown, but... slow. Slower by a factor of 200 in some cases. (Though if your code's that much slower it's normally a sign that you're really not playing to your language's strengths, but sometimes you can't do that) Needless to say, you'd prefer not to have that hit -- you want to go faster. I mean, who doesn't?

The normal answer to that is "JIT the code". It's a pretty good answer in a lot of cases. Except... except it's amazingly resource-heavy. First there's the CPU and transient memory cost of JITting the code in the first place.There are things you can do to make JITting cheaper (Parrot does some of those) but still... it's a non-trivial cost. Second, JITting the code turns what could be a shared resource (the bytecode) into a non-shared one. That's a very non-trivial cost. Yes, in a single-user system it makes no difference. In a multi-user system it makes a huge difference.

As a for example, $WORK_PROJECT has a parrot executable that's around 2M of bytecode. Firing it up without the JIT takes 0.06 seconds, and consumes about 10M of memory per-process. (15M total, but 5M is shared) Firing it up with the JIT takes 9 seconds and consumes 100M of memory per-process. (106M total, with 6M shared) On our current production machine we have 234 separate instances of this program running.

Needless to say, there's no way in hell we can possibly use the JIT. The server'd need somewhere around 23G of memory on it just for this one application alone. (Not to mention the memory needed for the other 400 programs being run, as when I last checked there were a bit over 600 total parrot-able programs running) The only way to make this feasible is to interpret the bytecode. (And is another reason for us to have directly executable bytecode, since it means we can mmap in the bytecode files and share that 2M file across those 200+ processes, and fire up Really Fast since it's already in memory and we don't have to bother reading in all 2M of the file anyway, just fault in the bits we need) Note that this isn't exclusively a parrot issue by any means -- any system with on-the-fly JITting (including the JVM and .NET) can suffer from it, though there are things you can do to alleviate it. (Such as caching the JITted code to disk or having special OS support for sharing the code, which general-purpose user-level programs just can't count on being able to do)

(Alternately you could use this as an argument for generating executables. That'd be reasonable too, assuming your program is amenable to compilation that way, which it might not be)

Then you've got the issue of enforcing runtime quotas and other sundry security issues. Some of these need to be done as part of regular interpretation (that is, you need to wedge yourself in between ops) and if you don't interpret quickly, well... you're doomed. It still isn't blazingly fast, as there's a cost involved you can't get around, but you can at least minimize that cost.

So what does fast interpretation get you? It gets you a performant portable engine, it gets you some significant resource savings, and it allows for fast security. Not a bad thing, overall. And good reasons to not be JIT-blind.

Posted by Dan at 03:46 PM | Comments (2) | TrackBack

June 11, 2005

WWIT: All those opcodes

This is one that comes up with some frequency -- why the hell does Parrot have all those opcodes. It's wasteful!

Bullshit. What it is is fast.

Opcode functions have two points. The first it to provide basic functionality. The second is to provide a fast function call interface. We'll take the second point first.

Parrot's got three official ways to call a function. The first is with the basic parrot function call. You fill in the registers with your parameters, set the parameter counts, and invoke the sub PMC. Not slow, but not horribly fast either. People tend to gripe about that, but it is, bluntly, the lowest-overhead general purpose solution that we could get. Perl puts a lot of requirements on function calls, as do the other dynamic languages, and providing that information takes a little work. That's fine, it's not that big a deal. Languages also are under no obligation to respect the calling conventions for any function or subroutine that's not exposed as a parrot-callable function. That is, if you're writing a Java compiler, say, and don't like parrot's overhead then... don't respect our calling conventions. Or, more reasonably, internally use your own, whichever's best, and then provide versions with shim wrappers that do respect the conventions for other languages to call.

The second way to call is as a method. That's got all the overhead of a function call plus the search for the actual thing being called. This is not a problem, that's what you want with method calls -- it's what you asked for, after all. Given the dynamism inherent in perl, python, ruby and friends, there's no way around it.

The unofficial way is with the bsr/ret pair. Only suitable for internal functions, it's still quite fast and damned useful. There's no reason your compilers shouldn't use this internally where needed -- it's handy.

The fourth way is the opcode function. This is the absolute lowest-overhead way to invoke a function. Unfortunately those functions (right now at least) have to be written in C, but that's just fine. The important thing here is that they are very, very fast to invoke, since there's essentially no overhead. No putting things in registers, no setting up calling conventions, nothing -- they're just invoked.

In fact, one thing that was planned was that modules could provide things that looked like functions but were actually exported ops (using the loadable opcode library system). That is, your code did a foo(bar), but the compiler knew that foo was an op and emitted:

foo P16

or wherever the bar happened to be. Moreover, module authors could slowly migrate their code from an HLL, to C code via NCI, to opcode functions. In some cases compilers could actually generate opcode functions from the source, though that does require the compiler in question to be able to generate C code. (But, conveniently, parrot can generate C code from bytecode...) When you think about it, and you really should, there's no difference between an opcode function and a regular HLL function with a compile-time fixed signature (except for the difficulty in generating them from bytecode).

Just to be real clear, since it's important, opcodes are just library functions with very low call overhead. That's it. Nothing fancier than that. They're not massively special internal anything. They're just functions that are really cheap to call. Cutting down the number of opcode functions is not sensible -- it's foolish. Any library function that could reasonably have a fixed number of arguments and not need the calling conventions (and not need to be overridden) should be an opcode function.

Concentrate on the function part. Not the opcode part.

Posted by Dan at 06:43 PM | Comments (0) | TrackBack

June 10, 2005

WWIT: Generating executables

One of the things parrot is supposed to be able to do, and currently does (albeit with on and off breakage) is generate standalone executables for parrot programs. That is, you feed in source or bytecode and you get back something that you can run the linker against to give you a standalone executable, something you can run without having parrot installed anywhere.

Interestingly, this is a somewhat controversial decision. I'll get to that in a minute.

The upsides to doing this are several:

  1. Distribution is easier
  2. No versioning problems
  3. Execution's faster
  4. Fewer resources used in multiuser situations

And of course the downsides:

  1. You can get a lot of big executables with a lot of overlap
  2. Some of the dynamic features (on the fly compilation, for example) are somewhat problematic
  3. Bugfix upgrades don't happen easily

Now, it's very important to keep in mind that generating executables is not a universal solution. That is, there are times when it's the right thing to do, times when it's the wrong thing to do, and times when it's kind of a wash.

Building executables has never been a general purpose solution, and in most cases the correct thing to do is to either run a program from the source, or run it from the compiled bytecode. (and there are plusses and minuses to each of those) However...

The problem with all the 'scripting' languages is packaging and distribution. Not just in the commercial sense, which is what a lot of people think of (and which causes a lot of the knee-jerk reactions against it, I think), but in the general sense. If I have a program I want to distribute to multiple users, it's a big pain to make sure that everything my program needs is available, especially if I've followed reasonably good programming practice and actually split my code up into different files. In that case I've all the source to my program that needs to be distributed, and a list of modules that the destination system has to have installed, along with their prerequisites, along with possibly the correct version (or any version) of the driving language.

This isn't just a problem with people looking to distribute programs written in perl commercially, or looking to distribute them in a freeware/shareware setting. It happens institutionally, a lot.You may have ten, or a hundred, or ten thousand desktops that you need to distribute a program out to. The logistics of making sure everything is correct on all those desktops is a massive pain in the ass, made even more complex by the possibility that you've got multiple conflicting requirements across multiple apps. (That is, you've got one app that must have perl 5.6.1, another that has to have perl 5.8.x but not 5.8.0, a third that requires one particular version of GD, and a fourth that doesn't care but has been tested and certified with one particular set of modules and you can't, either by corporate policy or industry regulation, use anything else)

That's when the whole "just install [perl|python|ruby] and the requisite modules" scheme really sucks. A lot. Pushing out large distributions with lots of files is a big pain, and pushing out several copies is even worse. Then there's the issue of upgrading all those desktops without actually breaking things. Ick.

This is where building standalone executables is a big win. Yeah, the resulting file may be 15M, but it's entirely self-contained. No worries that upgrading some random module will break things, no need to push out distributions with a half-zillion files, and if you want to hand your app to Aunt Tillie (or random Windows or Mac users) you've got just a single file. No muss, no fuss, no worries.

Yes, it does mean that end users can't upgrade individual modules to get bugfixes. Yes, it does mean the executables are big. Yes, it does mean there may be licensing issues. Yes, it does mean that pulling the source out may be problematic. Those are all reasons it's not a good universal solution, not a reason to not provide the facility for times it is. (That people have felt the need to roll their own distribution mechanisms to address this problem in the current incarnations of the languages is an indication that it is a real problem that needs addressing)

Like many other problems that there were multiple implementations for (like, say, events) Parrot provides a solution as part of the base system so folks can use their time reinventing other wheels more productively.

Posted by Dan at 12:27 PM | Comments (6) | TrackBack