February 15, 2006

Functions and subs in a threaded world

Here's an interesting question I've been pondering.

What if function and subroutine calls all spawned off their own threads and ran independently?

That is, what if the code:

foo = bar(1, 2, 3)
baz = xyzzy(3, 2, 1)

actually spawned off a thread to run the bar function, and a thread to run the xyzzy function? And both ran independently until they were complete?

There's a sequencing problem, of course. If the functions take a while, then this code:

a = foo()
b = bar()
c = baz(a,b)

could be rather problematic if we call baz before foo or bar have finished. On the other hand, if we know this can happen then we can design the VM or language to have tentative variables -- that is, variables that will get a value at some point, but don't yet have one, and any attempt to read from them pauses a thread until the value arrives. We'd get free sequencing that way, which is interesting and useful.

A second problem is one of global data. If any function call could spawn a new thread, and we could have a lot of threads running at once, we can't guarantee that there's any sort of synchronized access to global information -- we've got this big pot of asynchrony, after all. The easy answer there is just no global data, but that's a touch restrictive, even for me in these limited contexts. (Just barely, though)

Strongly discouraging mutable global data is a help here. If the global data is essentially static, then sequencing is irrelevant -- doesn't matter what order you read it in, the stuff's not going to change. It is very tempting to design Tornado's global data system such that you can read global data, or add a new piece of global data, but you can't change global data once it's been created. I really, really want to do this, since it'd make life ever so much easier of the only global 'changeable' things were sequence points, and I think there's a big difference between a global thing you coordinate activity on (basically a kind of POSIX condition variable) and a generic shareable wad of data.

There's a good reason for doing this thread-per-function-call thing, at least in special circumstances. Remember how Tornado's supposed to handle great wads of data in addition to heavy threading? Well, ponder this (ignore the horrid language syntax)

Declare Variable Vector foo = [1, 2, 3]
Declare Variable Vector bar
Declare Function Scalar xyzzy(Scalar)
bar = xyzzy(foo):iterate

What I'm trying for, awkwardly, is to note we've got a function xyzzy which takes a scalar value and returns a scalar value. We've called it with a vector, and stuffed the return into a vector. That's look wrong except for that :iterate thing there. That's supposed to indicate that, since the parameter's a vector, we actually call the function once for each element of the vector and aggregate the results. If we spawn a thread per function call... well, that could be interesting there. Yeah, in a single core single CPU system it's probably not much of a win, but... in a system with more than one CPU core handy, that could be really interesting. (Like if you've an SMP system, or one of the dual-core systems that are out, or better yet a multi-chip multi-core system -- one of those dual-CPU quad-core desktop systems people are making noises about for 2007, say)

This is where things get interesting, definitely. There's more fun to be had, too, if you want fancy -- like, for example:

Declare Variable Vector foo = [1, 2, 3]
Declare Variable Vector bar = [1, 2, 3]
Declare Variable Vector baz
Declare Variable Vector plugh
Declare Function Scalar xyzzy(Scalar, Scalar)
baz = xyzzy(foo, bar):iterate
plugh = xyzzy(foo, bar):permute

The first call to xyzzy would iterate through the two vectors -- first call would have (1,1), the second (2,2), the third (3,3). The second call, the one marked permute, would instead call xyzzy nine times, with all the different permutations of the elements of the two vectors. Yeah, permutation explodes out the thread set pretty badly with large vectors, but of course these are all virtual threads, since we don't have to actually spawn them, just make it look like we did.

Permutation does make a good case for going up to full-blown tensors rather than just vectors, since that permutation example could reasonably return a 3x3 matrix rather than a 9-element vector, but for right now I'm still dodging full tensors.

Now, while this is an Interesting Thing with plain integers as the scalars inside our vectors, it gets actually exciting if the scalars can be large blobs of data. If, for example, you had a crypto system written this way, each scalar in your vector could be a block of data to be encrypted or decrypted. Or if you had a rendering system written in it, each scalar could be an image map for an object to be rendered, or a chunk of text to be searched for some particular substring (or that has some regex done across it, or something). Again, on a uniprocessor single-core system, not so big a deal. On a system with multiple cores... a bigger deal.

Like I said, I wouldn't do all my general-purpose programming in this environment, since really static global data is a pain for a lot of larger applications. As a core for a subset of operations, though...

Posted by Dan at February 15, 2006 12:21 PM | TrackBack (0)
Comments

Once again, too interesting for me to not comment.

Why not get rid of scalars variables altogether and instead treat everything as a vector? For example, the following are sematically the same, they both assign a one element vector to bar:

Declare Variable bar = [1]
Declare Variable bar = 1

Also

Declare Variable bar = [1, 2, 3, 4]
Declare Variable bar = [1, 2] ++ 3 ++ 4

I'm using "++" here as a vector concat operator.

So basically, the distinction of vector and scalar is dropped completely, all scalars are really just single element vectors. I think that can simplify the syntax and language model considerably, but probably at the expense of being harder to optimize.

Posted by: Damien Katz at February 15, 2006 01:34 PM

Have you looked at the Occam language?

It allows you to launch two pieces of code in parallel "threads" very simply, eg
PAR
doFoo
doBaz

Communication between threads happens with channels, which are a basic data type in Occam.
I guess where you use these vectors, you could simply have one producer write the elements of the vector into a channel, and some consumer read them in from the channel.

Posted by: murphee at February 15, 2006 03:43 PM

There are some good reasons to have scalar and vectors as separate things. The individual items will still be typed -- integer, float, bigint, string, binary blob -- and as such need to have some measure of independent identity. (I think. Probably, at least)

Also, I don't necessarily want to stop at vectors. As someone pointed out earlier, going the full tensor route wouldn't be a bad thing, even if I'm going to skip it for now for ease of implementation.

There's a difference in behaviour between scalars and single element vectors as well. In the iterate case above, with multiple vectors of different lengths a scalar would get duplicated for each call while a one-element vector would get padded with something. (This case isn't illustrated, but it should've been)

Still, it's an interesting idea. Not going to do it, but it does merit some consideration.

Posted by: Dan at February 15, 2006 05:10 PM

I know about Occam and took a look at it earlier. It doesn't quite do what I want, and I remember the syntax showing its age. (Though I'm far more interested in the semantics, of course. Syntax is mutable, and there's no reason not to have multiple languages) Occam's also about flow of data between processing nodes while those nodes still run -- basically a stream thing. I'm looking more at a message passing system or a fire-and-forget (mostly) system.

Occam's definitely interesting, but Erlang's a bit more interesting, at least for what I'm looking to do.

Posted by: Dan at February 15, 2006 05:41 PM
The second call, the one marked permute, would instead call xyzzy six times

Typo? Should the "six" be "nine"? Also, if it does what I think it does then it's a product, permutations are about reordering a set. Especially if it you're going to expand to tensors some day.

Posted by: Fergal Daly at February 15, 2006 05:52 PM

Yeah, the six was a typo, it should be nine. I was using permute in the sense I'm used to -- all the possible combinations of the things in the two vectors. It does mean looking at the vectors as sets rather than as one-dimensional tensors. Hrm.

I'm gonna end up confusing everyone (including myself) mixing my metaphors if I'm not careful.

Posted by: Dan at February 15, 2006 06:08 PM

The "all variables are vectors" idea comes from my own work on Fabric ( http://damienkatz.net/2005/12/a_brief_introdu.html ), which is based on Notes Formula Language (I wrote the engine that's currently in Lotus Notes). The language is list based, everything is a list, which is mostly the same as a vector and most infix operators work pairwise on the lists (same as your iterate). The way it works is Fabric is when doing pairwise operations if one list is shorter than the other then the last element of the shorter list is reused for to complete the operation. It works pretty well in practice. So for example, following your syntax:

Declare Variable foo = [1, 2, 3]
Declare Variable bar = [1, 2, 3]
Declare Variable baz = [1, 2]
Declare Variable res
Declare Function xyzzy(Var, Var)

res = foo + bar
// same as:
// res = [1, 2, 3] + [1, 2, 3]

res = foo + baz
// same as:
// res = [1, 2, 3] + [1, 2, 2]

res = foo + 1
// same as:
// res = [1, 2, 3] + [1, 1, 1]

xyzzy(foo, bar)
// same as:
// xyzzy([1,2,3], [1,2,3])

xyzzy(foo, bar):iterate
// same as:
// xyzzy(1, 1)
// xyzzy(2, 2)
// xyzzy(3, 3)

xyzzy(foo, baz):iterate
// same as:
// xyzzy(1, 1)
// xyzzy(2, 2)
// xyzzy(3, 2)

xyzzy(foo, 1):iterate
// same as:
// xyzzy(1, 1)
// xyzzy(2, 1)
// xyzzy(3, 1)

The permuted cases are kind of obvious. Anyway, just some ideas to get you thinking, whether or not this fits with your vision is another matter.

Posted by: Damien Katz at February 15, 2006 09:07 PM

What you're describing looks like futures or promises (with HoF for the iterate/permute example). One problem you might find is how to restrain resource consumptions. Naïvely letting each thread have an equal timeshare can have surprising effects, especially with recursive functions. Assigning a certain weight to each thread *and its children threads* is probably one of the easiest solutions.

Posted by: pkhuong at February 15, 2006 10:58 PM

tentative variables is data flow programming. Look at prograph/Labview.

Also, can you post a blog/reply here on how you learnt erlang.I am interested about the sources you have tried and more importantly, the programs you have written to learn the language.

Posted by: mansu at February 16, 2006 04:56 AM

This is an area of language design that has an existing large body of research. Generally, the approach is called 'lenient' evaluation (as in between strict and lazy). Lots of the PL researchers consider it the best choice for implicitly parallel languages.

With lenient evaluation schemes you can usually staticly determine the interdependancies and schedule the order of evaluation at compile time.

I think an interesting approach would be to determine how many concurrent threads the machine can handle efficiently (~1 per core seems a good huristic) and schedule the evaluations in that many threads at run time. To do it efficiently you may need to do considerable analysis at compile time.

Posted by: Alan at February 16, 2006 08:39 AM

Restricting the number of OS threads to the number of CPUs is definitely a good idea. However, restricting the number of logical threads to the number of CPUs would make the host hardware radically change the semantics of one's programs. For example, POR(Omega, T), where POR is a parallel Or and Omega does not terminate would still terminate and return the right value if the two children processes were to be executed, but not for only one.

But yeah, some sort of static way to find when single-threaded lenient evaluation would have the same effect would probably pay off a lot. (If parallel evaluation is built in the language, I expect the problem to be to keep the number of extraneous threads down rather than to ensure we have enough work for each core ;)

Posted by: pkhuong at February 16, 2006 10:44 AM

I'm planning on doing the concurrency and coordination in the VM, rather than in the language. This is partly so it's guaranteed to be there for any language that wants to use it, partly because it means I can do Clever Things like tag subs as being paralellizable or not without recompiling the code that calls them, and partly because I'm a lot more comfortable doing Clever Things in the VM design than I am in compiler design.

The lenient execution's really more of a side-effect of the thread spawning than anything else -- I hadn't really considered it all that interesting. There are some nifty possibilities for efficiency in there, though, that I hadn't thought about. If I delay actual execution of a sub until its result is used, then some code would be completely skipped in certain circumstances, which'd certainly be a performance win. On the other hand, there's the issue of side-effects, but the design of the system strongly discourages side-effects, since they're such a pain to deal with when concurrency is involved.

I may have to think about that some, though given I've a reasonably short timeline here to get it up and running, it's something that may well just have to wait until later. Interesting, though.

Posted by: Dan at February 16, 2006 12:17 PM

You should look up the research on futures. What you are proposing is basically to wrap every expression in a future and using the future infrastructure to manage synchronization.

Posted by: Jay McCarthy at February 18, 2006 02:17 PM

Have you come across Cilk (a C derivative for writing chess programs, etc.) and the related FFTW project - they essentially do create a lightweight thread for function calls and resync automatically on return. Looked impressive, never actually dug into the code myself. They have some papers that might be helpful to you though.

Posted by: Anon at February 21, 2006 08:53 AM