February 15, 2006

Functions and subs in a threaded world

Here's an interesting question I've been pondering.

What if function and subroutine calls all spawned off their own threads and ran independently?

That is, what if the code:

foo = bar(1, 2, 3)
baz = xyzzy(3, 2, 1)

actually spawned off a thread to run the bar function, and a thread to run the xyzzy function? And both ran independently until they were complete?

There's a sequencing problem, of course. If the functions take a while, then this code:

a = foo()
b = bar()
c = baz(a,b)

could be rather problematic if we call baz before foo or bar have finished. On the other hand, if we know this can happen then we can design the VM or language to have tentative variables -- that is, variables that will get a value at some point, but don't yet have one, and any attempt to read from them pauses a thread until the value arrives. We'd get free sequencing that way, which is interesting and useful.

A second problem is one of global data. If any function call could spawn a new thread, and we could have a lot of threads running at once, we can't guarantee that there's any sort of synchronized access to global information -- we've got this big pot of asynchrony, after all. The easy answer there is just no global data, but that's a touch restrictive, even for me in these limited contexts. (Just barely, though)

Strongly discouraging mutable global data is a help here. If the global data is essentially static, then sequencing is irrelevant -- doesn't matter what order you read it in, the stuff's not going to change. It is very tempting to design Tornado's global data system such that you can read global data, or add a new piece of global data, but you can't change global data once it's been created. I really, really want to do this, since it'd make life ever so much easier of the only global 'changeable' things were sequence points, and I think there's a big difference between a global thing you coordinate activity on (basically a kind of POSIX condition variable) and a generic shareable wad of data.

There's a good reason for doing this thread-per-function-call thing, at least in special circumstances. Remember how Tornado's supposed to handle great wads of data in addition to heavy threading? Well, ponder this (ignore the horrid language syntax)

Declare Variable Vector foo = [1, 2, 3]
Declare Variable Vector bar
Declare Function Scalar xyzzy(Scalar)
bar = xyzzy(foo):iterate

What I'm trying for, awkwardly, is to note we've got a function xyzzy which takes a scalar value and returns a scalar value. We've called it with a vector, and stuffed the return into a vector. That's look wrong except for that :iterate thing there. That's supposed to indicate that, since the parameter's a vector, we actually call the function once for each element of the vector and aggregate the results. If we spawn a thread per function call... well, that could be interesting there. Yeah, in a single core single CPU system it's probably not much of a win, but... in a system with more than one CPU core handy, that could be really interesting. (Like if you've an SMP system, or one of the dual-core systems that are out, or better yet a multi-chip multi-core system -- one of those dual-CPU quad-core desktop systems people are making noises about for 2007, say)

This is where things get interesting, definitely. There's more fun to be had, too, if you want fancy -- like, for example:

Declare Variable Vector foo = [1, 2, 3]
Declare Variable Vector bar = [1, 2, 3]
Declare Variable Vector baz
Declare Variable Vector plugh
Declare Function Scalar xyzzy(Scalar, Scalar)
baz = xyzzy(foo, bar):iterate
plugh = xyzzy(foo, bar):permute

The first call to xyzzy would iterate through the two vectors -- first call would have (1,1), the second (2,2), the third (3,3). The second call, the one marked permute, would instead call xyzzy nine times, with all the different permutations of the elements of the two vectors. Yeah, permutation explodes out the thread set pretty badly with large vectors, but of course these are all virtual threads, since we don't have to actually spawn them, just make it look like we did.

Permutation does make a good case for going up to full-blown tensors rather than just vectors, since that permutation example could reasonably return a 3x3 matrix rather than a 9-element vector, but for right now I'm still dodging full tensors.

Now, while this is an Interesting Thing with plain integers as the scalars inside our vectors, it gets actually exciting if the scalars can be large blobs of data. If, for example, you had a crypto system written this way, each scalar in your vector could be a block of data to be encrypted or decrypted. Or if you had a rendering system written in it, each scalar could be an image map for an object to be rendered, or a chunk of text to be searched for some particular substring (or that has some regex done across it, or something). Again, on a uniprocessor single-core system, not so big a deal. On a system with multiple cores... a bigger deal.

Like I said, I wouldn't do all my general-purpose programming in this environment, since really static global data is a pain for a lot of larger applications. As a core for a subset of operations, though...

Posted by Dan at 12:21 PM | Comments (14) | TrackBack

February 14, 2006

Auntie Em, Auntie Em!

And yeah, I understand the multiple layers of humor in the title.

The VM stuff I've been thinking about's now started to coalesce, and as such I've started doing serious design work and coding. (Yes, I know, design then code, but boilerplate and general infrastructure's pretty straightforward) The engine's reasonably special-purpose, but that's fine, I knew that going in.

It's now got a name too -- the engine'll be called Tornado, since we're taking large chunks of data, whirling them around in mostly similar ways, and later flinging them out. Causing a potentially large amount of havok in the process, of course. I do have a direct application for the engine right now, so there's reason to bang this out, and when it's ready it'll be available both as a standalone embeddable library and as a perl module. If I'm feeling enthusiastic I may make a ruby library too, but we'll see how that goes.

This engine will share a few features with Parrot, though honestly not that many. Some of the techniques in the parrot build are really useful (like the separate files with a specialized mini-language for opcode functions) and I'll use those. Architecturally there are some significant differences in the problem space, and since Tornado's space is much smaller and only partially overlapping with parrot, some of the compromises that went into parrot I just don't have to make. That's kinda cool, actually. Tornado programs will also be smaller on the whole (Dunno about you, but I sure don't want to be writing whole apps in an environment geared very much towards massively threaded processing of mostly vector data) which means we can make some cheating tradeoffs for speed and safety too.

More details'll go up soon as I hash out some of the still-fuzzy bits, and the subversion repository'll likely get opened up for anonymous checkout if I can't get a timely CPAN module release.

Posted by Dan at 04:49 PM | Comments (2) | TrackBack

February 06, 2006

The depressingly recurrent rant

This is one of those things you'd think that nobody'd have to rant about, certainly not multiple times, but alas it is.

My ISP does what I assume is some minimal amount of network security stuff. I'm not sure how much in general as I'm in their static-IP pool, so I get to do things that other folks may not (like run mail and webservers) but they do at least some, and that's nice.

As part of this they apparently occasionally run port scans. This is not a big deal -- for most people it means that they'll get a little thumping, and if they're doing something they ought not or, more likely, are infected with some piece of listening malware they'll be found out. That's fine. My firewall machine's set up right and I don't have this sort of crap running.

And for added security, I've got snort running, set up to automatically black-hole hosts that try Evil Things. Like, say, port scans. No big, right? Means I occasionally black-hole my ISP's scanning machine.

My ISP scans from their primary DNS host.

Yeah, that's right, when they launch a scan I lose either my primary or my backup DNS server, which tends to screw all sorts of things up. Only on the server itself (the PPPoE software likes to kill my DNS settings and substitute its own in resolv.conf) since all the house machines default to the DNS server running on my server box, it's only the clients on the server box itself that tend to get hosed and stuck using the now-inaccessible DNS host.

Jeez, people, do not run port scans on a machine that does other actual useful things. It's stupid.

Posted by Dan at 08:37 AM | Comments (0) | TrackBack