January 23, 2006

Threading primitives for a new machine

As I said earlier, I'm working on the design of an embeddable VM geared towards heavy multithreading and vector math operations. Not exactly commonly trod ground, which is part of what makes it interesting. What I'm poking around with right now is what the system needs to provide to make the sorts of problems I'm interested in here easy to solve. (Or as easy as it gets when you're dealing with potentially massive multithreaded systems)

The requirements we have for threading are:

  1. We may apply the same algorithm to many blocks of data simultaneously
  2. Each block processed may need to provide data to process the next block, or receive data from the previous block to continue
  3. Many threads may need to synchronize to a clock signal of some sort
  4. Individual threads may need to send data to other single threads
  5. Threads may need to be paused or killed

So basically we may spawn off a dozen or more threads all running the same code, pass in a chunk of data to each thread, and each thread will return a processed version of that chunk to the controlling thread. Those threads may pass the data forward or backward, to the thread processing the next or previous chunk of data. All threads in a group may need to synchronize to one spot. Threads have to be reasonably safely killable or pausable. And we need to be able to send or receive messages between arbitrary threads. This all adds up to an interesting architecture.

From that list it's pretty clear we need a few things, and we're going to get some restrictions that we're going to chafe at. Such is life -- nothing's free. Anyway the things we're going to get are:

Ordered Thread Pools

In this case I'm referring to a set of threads, all of which are running the same code, differing only in the parameters they're passed. Each thread in the pool knows where it was spawned in relationship to the rest of the threads in the pool so we can find the next and previous thread in the pool to send data to if we must.

The thread pool may or may not be dynamically mapped to OS threads -- if thirty or forty threads should be spawned it doesn't mean we'll actually fire off thirty or forty OS threads. We may only fire off ten or twenty (or two or three) and just make it look like there are a huge number, and we may dynamically fire off threads as we need them, for example waiting to fire off a thread to receive data until the sending thread has actually sent the data. Not necessarily on a one by one basis, but if we have forty chunks of data to process, we may only start threads to process chunks one to four, and hold off on starting the thread for chunk five until either chunk 1 is done being processed or something sends data to the thread processing chunk 5.

There are interesting arguments to be made to having the pool be N-dimensionally ordered, rather than 1-dimensionally ordered. That is, rather than just having a next and previous thread (or Nth forward/Nth backward thread) you have a 2 dimensional grid of threads, or a three dimensional cube of threads, which would allow for some interesting simulation capabilities. For now, though, I think we'll pass on that. Maybe later.

Messages

Threads are going to have to sling data between them. For this to work we need messages of some sort. The fastest and safest, though most annoying, way to do this is with read-only messages slung between the threads. That is, rather than some sort of shared data that gets accessed by multiple threads, the data sent in a message is read-only, unalterable by either the sender or receiver. Keeping everything readonly (to the point where we may make a copy on send, though there are time issues there if the data being sent is large) makes life simpler, requiring much less synchronization.

Message Ports

If we're sending data between threads we have to have some place to send the data to, and for that we need message ports. Or sockets, or whatever you want to call them. I like message ports, but I started doing IPC on the Amiga and I'm the one at the keyboard, so you get to bear with me. A message port is a place messages get sent to. Each thread has one, each thread pool has one, and threads can potentially create more if they want. Ports have names so they can be looked up by name if need be, since the defaults a thread can find (its pool port, its own port, and the port of threads forward and backward in the pool list) may not suffice for all activities.

Threads can wait on multiple message ports at once, and all inbound messages are tagged with the thread they came from and the port they were sent to, as well as information for replying if need be.

Optional message replies and acknowledgement

When you're slinging messages around you need to decide how its handled. Are they always asynchronous? Do you require that the sender is notified? Do you allow the receiver to reply to the message? And do you leave this sort of thing up to the application code or provide the primitives?

Well, for something like this it makes sense to wedge it in at the lowest possible level, so it can be done efficiently. That means there's a send, send-noted, and send-and-wait, as well as a receive, receive-acknowledge, message-acknowledge, and message-reply.

No shared data

Or almost none. Shared data requires synchronized access, which is expensive, has potential deadlock problems, and gets in the way of being able to kill a thread. Every single shared resource is a potential place a thread can leave things in an inconsistent state if killed, and every lock on a shared object is a place that you can deadlock if a thread is paused. Every shared resource is a spot where programmers can potentially screw up and deadlock things as well, and that's no good either.

What little shared data there is going to be available is unlockable and read-only. Accessing the data gets you either a copy or a read-only version -- altering global data will either be impossible or replace the shared data, leaving any thread that already fetched the data out with the old version. (Shared data being done by reference rather than by value, with shared updates altering the reference)

Handling shared data this way makes it easier to kill threads -- since a thread never actually has a lock on any shared data there are fewer things to get messed up if we kill off a thread, and fewer potential deadlock spots if we pause a thread.

Restricted external access

This is in part because of the embeddable nature of the VM and in part to help with the pause/kill feature. Killing a thread's only a problem when they've got a lock or have a global resource in an inconsistent state -- if a thread's not doing that they can be safely disposed of. Where most thread kill problems come in when killing a thread (besides the fact that people have too damn much global data in their application) is in the runtime libraries -- a lot of calls into them fiddle with global data. You can protect against inopportune nuking by wrapping those calls with a lock that'll block thread death, but that gets expensive pretty fast with a lot of calls.

If, on the other hand, your VM makes very few calls into the runtime library and associated support libraries, you can reasonably protect against inopportune death and kill threads off OK. Yes, this means no C runtime library calls or OS calls from within the virtual machine except in a very few specific spots, and it means that each thread will have to manage its own memory pool (so memory allocation's thread-local unless a thread runs out of memory), but that's fine -- we are a relatively specialized VM after all. Those few calls to the outside world we need to make (like for IO) can be delegated to the embedding application and wrapped with locks so a thread in the middle of a callout to the application won't be killable.

Pause and nuke threads

A lot of the above things make pausing and killing threads easier. We won't necessarily guarantee immediate death or pausing, unless we've got OS support for that sort of thing, but if the above restrictions (no shared data, callouts to places where we'd be vulnerable protected by locks preventing death, external bookkeeping to manage communication bits) are all in place it means we can pretty much have at the threads with an axe and not have to worry, which is a good thing.

The wrapup bit

Yeah, I know, before anyone says anything, building your own threading system's one of those things you almost never want to do, and 99% of the time that you think you do you're wrong. In this case, though, I've some specialized needs that make it a reasonable thing to do, and while the restrictions needed to make it work out are annoying for general-purpose programming, for special-purpose stuff, well...

It'll be interesting to see how this plays out once it's done and fiddled with. While I certainly wouldn't want to live within these restrictions when writing, say, a database or word processing app, for sub-applications (some graphical work, simulations, game AI, and crypto stuff) where part of the application would benefit from massive multithreading with mostly isolated threads it ought to be interesting to work with.

Posted by Dan at January 23, 2006 04:59 PM | TrackBack (0)
Comments

This new VM wouldn't be for the cell would it? it sounds perfectly designed for code running on the SPEs, leaving a more general VM (or even just C/C++ code) to run on the PPE.

Posted by: pete at January 23, 2006 06:16 PM

Nope, not for the cell -- I don't have any hardware that uses it. (I can just see trying to convince the accountant that the PS3 really was a business expense... :)

The VM fits some problems I'm working on, and ought to make things easier even on regular desktop hardware. Ought to be interesting on some of the new multicore chips, though.

Posted by: Dan at January 23, 2006 07:03 PM

Not sure how it fits your requirements, but I just noticed the other day that the Inferno O/S has been open source for a year or two. It has a Communicating Sequential Processes (CSP) model, with cheap threads and transparent distributed processing. Runs either as a full O/S on the bare metal, or in a virtual machine on a host O/S.

It might have some interesting ideas, if nothing else.

Posted by: Phil Rand at January 24, 2006 04:55 PM

I don't know that either of these links will actually be helpful, but they both seem like they might be looking at parts of the problem you are dealing with.

(Termite) http://bc.tech.coop/blog/060122.html
(Vlerq) http://www.equi4.com/vlerq.html

Posted by: Connor Berry at January 27, 2006 04:25 PM

"Shared data requires synchronized access, which is expensive, has potential deadlock problems, and gets in the way of being able to kill a thread."

A lot of work has been done recently on lock-free shared data structures and software transactional memory that eliminate deadlock and deal with the expense of locking somewhat. At the moment I'm reading:

http://www.cl.cam.ac.uk/users/kaf24/papers/phd.pdf

Also get to hear some lectures from this guy in a few weeks time...should be interesting. Anyway, I post this in case you are kinda wanting shared data structures but feel they aren't practical; of course, if you're happy you really don't want them or lock free mechanisms don't solve your issues, that's fine too. :-)

Nice to see you writing again.

Jonathan

Posted by: Jonathan Worthington at January 28, 2006 05:08 PM