June 16, 2003

Interpreting Python Bytecode

Well, I spent the weekend digging into some of the pending Parrot work, of which there is far too much that I bottleneck. (Important project management tip for folks doing project management--identify those places that one person is a bottleneck for, and dive for those areas as quickly as you can. Places that anyone, or other people, can work on can't get done until you do. And Delegate, dammit!) Anyway, one of those places that's pending is the python bytecode stuff. Yeah, I know, it's not as high a priority as, say, objects or exceptions, but given that Larry's not specified either yet it makes sense to tackle bits that don't block on him. Besides, I really need a handle on Python's object system before I finish Parrot's, since I don't want to be building two different object systems if I don't have to.

The reason for this, of course, is the up and coming Great Python Pie-throwing Contest, coming soon to an OSCON 2004 near you! Or near Portland, Oregon, at least, since I have no idea where you are.

So, after an hour or so of digging through with google and the python.org search facilities, I located some reference docs for python's bytecode. (Link so I can find them again later) Oddly, in the 2.1 maint docs, but not in the 2.2 docs anywhere. (Dunno whether it was an oversight or on purpose) The docs are, as you'd generally expect, fairly terse, as this stuff tends to be, since it hangs on knowledge of a whole lot of other stuff--if you know the other stuff you only need a terse description in the docs, and if you don't then no amount of documentation is likely to help.

Reading through assembly documentation (which bytecode is, really) is always an interesting undertaking. All assemblies have quirks, oddities, and other assorted things that make you go "Hmmm...." built into them, as you find things that aren't done the way you might think. They're often a result of other limits (many older CPU assembly languages were downright odd because of the need to save transistors), the personality of the designer, or hasty decisions, or dead-ends accumulated over time, and finding those is fun.

Python, of course, is no different. The bytecode has a bunch of standard stuff--stack rotates, math, and binary operations--but it's got some other oddities. The four-item stack rotate, for example, is kind of interesting, and a bit unusual. (Generally you have top-two swap and top three rotate, and if more's needed there'll be a general-purpose "rotate N" op) PRINT_NEWLINE, which prints a newline, is also kind of interesting, as having an op dedicated to printing just a newline's not common. (This isn't normally a time-critical operation, but it does have the advantage of being able to emit a platform-native newline, without having to translate a linefeed character or \n escape sequence, so it's a niftyish idea)

Then you get to the places where it became obvious that a limit was hit. The python bytecode apparently mandates 16 bit parameters. That's OK for most things, but the EXTENDED_ARG op, which holds bits 16-31 of the argument for the following op (making it a 32 bit argument split into two pieces) shows those places where it wasn't quite enough. (I'm wondering whether I can safely merge the two args together when I generate parrot bytecode--I can see Really Tricky Code branching to the op following the EXTENDED_ARG op in some cases, either to have a variety of top-16 bits or to take it as either a 16 or 32 bit op, but I'd bet that's not going to happen, as the Python folks tend to violently disagree with such things, which works for me)

You can also see how deeply various data types are welded into the interpreter. The tuple (Basically a fixed-length list of items) and the hash (sorry, dictionary) is pretty fundamental to python, while objects are... well, objects are an afterthought. At least that's the way they look from under the hood. The interpreter does support them, but there's barely more than what Perl has under here. I'm not a Python Guy, so I can't speak for the language support for objects, which may or may not be more significant. (Whether it is depends on how much the person explaining them likes Python) You certainly don't need anything fancy under the hood to support a fancy model--heck, it's all bits fed to the CPU when you get to the end of it--so I don't much care. A simple model actually makes it easier on me.

And, of course, there's the hackish stuff. CMP_OP springs to mind--this is an opcode that takes the name of the comparison operation as a parameter, then looks that comparison name up in a table somewhere and does it. That's... interesting. On the face it seems rather excessively indirect, but as I don't have the python sources at hand I can't say for sure that it's silly. I suppose there could be an extendable comparison table facility to allow the addition of new operators, but that doesn't seem too Pythonish. There's also the LOAD_CLOSURE op which takes a single integer parameter that indexes into two name tables simultaneously--if the offset is within the first table (the co_cellvars table) it's that name, and if it isn't it's an offset into the second (co_freevars) name table (after subtracting the number of items in the first name table) and it looks like anything else that references the cell and free variable store does this too. I suppose if nothing else it'll keep you from adding in new cells to things at runtime...

Anyway, looking at this all as a whole, it should be pretty trivial to get parrot to run bython bytecode, at least once the bytecode's preprocessed to turn it into parrot bytecode. At the moment the big issue is how much work should be done on all of it. Parrot, of course, is a register machine, while Python is a stack machine. It'd be dead-simple to have a bunch of python-specific ops that work just on the stack, but that's not the way to speed, honestly. It's even possible (though not particularly likely) that the python interpreter would actually be faster in that case, though honestly I doubt it. Might do it that way anyway, just to see, though. Handling the quirky bits shouldn't be a problem at all.

The bigger question is one of integration. The big reason I'm doing this is to get a chance to hurl a pie at Guido (well, probably Guido) which I've found to be rather motivating, but I'd rather this not end up as a one-off hack. (Ignoring the stack analysis code that'll result from it, though that's hardly trivial) What I want is for this to develop into a real, integrated python bytecode translator and interpreter, so Parrot can execute Python bytecode alongside other Parrot languages. Building a Python compiler to target Parrot would be the best way, and I suppose someone'll do that, but there's a limit to the motivating power of pie. Besides, there's a fair amount of bytecode around, so it's certainly a sensible thing to do.

The timeline, FWIW, is that at the Python Lightning talks at OSCON 2003 Guido and I will formally announce the details of the contest and the rules. (Which still need to be worked out, though we've a rough idea of what we're doing) Sometime by late December 2003 the bytecode for the test program will be generated, and at OSCON 2004 we'll have the actual face-off, followed by pie at 10 paces. (Mmmmm, pie!) While I expect it won't be a hugely exciting thing, I'm sure the video of the pie will haunt someone for years to come. Which should be fun.

(Yeah, yeah, I know... "Them's fightin' words!" Well, duh! Fight's already started :)

Posted by Dan at June 16, 2003 02:32 PM | TrackBack (0)
Comments

You can find the bytecode docs for 2.2 at http://www.python.org/doc/current/lib/bytecodes.html . 2.3 is also there; just look in the docs for the dis module.

Posted by: Brett at June 16, 2003 11:25 PM

Examples of some subtle differences between Perl and Python, w.r.t. object orientation:

Perl:
join('',foo)
('1'+'2') == (3)

Python:
''.join(foo)
('1'+'2') == ('12')

Commentary:

In Python, join is a method on str. In Perl, join is simply a function. This is not a fundamental difference, it just points out that Python happens to use object orientation in the builtin language library more often than Perl does.

If Perl, '+' is a numeric function, and will coerce its arguments appropriately. In Python, '+' is a convenience for calling a method named __add__. Int and Str implement quite different semantics for this method.

Posted by: Sam Ruby at June 17, 2003 07:13 AM

I'm not particularly worried about the string vs number issues with + as that's something for the python class vtables to handle. (Though these days I find I really dislike overloading + to mean both addition and concatenation, but that's just me. The lack of symmetry for - also grates) There'll definitely be some issues getting the class libraries over, but this'll be the least of them, I expect.

It is interesting to see what the syntax does that the underlying engine doesn't really enforce, as well as what just isn't supported, but a lot of that's really academic--I just need to make sure the low-level semantics of the translated code matches. :)

Posted by: Dan at June 17, 2003 01:45 PM

One of the very few things I like about BASIC is its inclusion of a separate string-concatenation operator, &. As with Perl, it is free to automatically coerce its arguments to strings because its unambiguous. On the other hand, in Microsoft VB they also overload +, which seems like a good way to sow confusion.

Orwell and presumably Haskell have a separate operator for concatenation, ++, which works between sequences (lists) and strings (which are of course merely sequences of characters):

"1" ++ "2" returns "12"
"1" + "2" and 1 ++ 2 are both forbidden.

From an elegance point of view, I prefer having a separate concatenation operator, for reasons already mentioned. On the other hand, in practical programming, the Python approach works fine and does not cause too many surprises.

Posted by: Damian Cugley at June 20, 2003 06:10 AM

I just took ten paces. _I_ would miss at that distance. (Its a pie, not a ball!) I suggest lots of practice, a shorter distance, a carefully weighted pie, or a novel technique (underhand? like a frisbee?). Don't take chances!

Posted by: RJ at June 22, 2003 02:14 AM
And, of course, there's the hackish stuff. CMP_OP springs to mind--this is an opcode that takes the name of the comparison operation as a parameter, then looks that comparison name up in a table somewhere and does it.

If I remember my Concepts of Programming Languages class, that is how PL0 does it. Not that I'm suggesting this is a good thing now, but it was interesting having to write a PL0 compiler and interpreter for that class. There was only something like seven opcodes, if I remember. Very odd language.

Posted by: michael at June 22, 2003 03:28 PM

The first version of a Perl module I wrote for reading Python bytecode is available here:

http://www.gregorpurdy.com/gregor/sw/Python-Bytecode-SAX/

Posted by: Python::Bytecode::SAX at June 23, 2003 01:17 PM

Oh, I fully expect to miss at 10 paces. I didn't realize how far it was when I made the challenge, but now... Well, I'll give it a good try at least. If nothing else, missing won't be bad--one of us (well, OK, Guido :) will have to deal with losing, getting pied on top of it is just adding insult to injury.

Besides, it's always possible that someone eminently more pie-able will be closer. You never know, and who am I to pass up a good opportunity for full-contract pastry?

Posted by: Dan at June 23, 2003 02:11 PM

The lack of symmetry for - also grates

You just need to go back and write more DCL...

Posted by: Brad at June 26, 2003 05:24 PM