On Aug 11, 2007, at 12:35 PM, Brian Hurt wrote:
You guys might also want to take a look at the Cilk programming
language, and how it managed threads. If you know C, learning Cilk
is about 2 hours of work, as it's C with half a dozen extra
keywords and a few new concepts. I'd love to see Cilk - C +
Haskell as a programming language.
It was called pH, and we (meaning Alejandro Caro and myself)
implemented it back in the mid/late 90's using Lennart Augustsson's
hbcc front end (which he hacked to add a bunch of pH-specific
syntax). Arvind and Nikhil wrote a textbook on pH programming.
There are two problems, still: one is that laziness means you can't
actually prove you need something until very close to the time you
actually want it. By the time I know that I'm adding f x to g y,
it's probably too late to usefully run them in parallel (unless
they're both *very* large). We used eager evaluation in pH---to the
point that we actually gave up the ability to manipulate infinite
lazy data structures. In NDP they've done much the same thing, first
instrumenting the program to see that the eagerness they introduce
won't disrupt execution. Even the "par" annotation has this feel: we
are telling the implementation that it's OK to do some computation
even if it isn't yet obvious that we'll need the results.
The second problem is controlling the overhead. More on this below.
The key idea of Cilk is that it's easier to deparallelize than it
is to parallelize, especially automatically. So the idea is that
the program is written incredibly parallel, with huge numbers of
microthreads, which are (on average) very cheap to spawn. The
runtime then deparallelizes the microthreads, running multiple
microthreads sequentially within a single real thread (a worker
thread). Microthreads that live their entire life within a single
real thread are cheap to spawn (as in "not much more expensive than
a normal function call" cheap).
The problem here is that while Cilk spawns are incredibly cheap,
they're still more than a simple procedure call (2-10x as expensive
if my fading memory serves me rightly). Let's imagine we have a
nice, parallelizable computation that we've expressed using recursive
subdivision (the Cilk folks like to use matrix multiplication as an
example here). Near the leaves of that computation we still spend
the majority of our time paying the overhead of spawning. So we end
up actually imposing a depth bound, and writing two versions of our
computation---the one that spawns, which we use for coarse-grained
computations, and the one that doesn't, which we use when computation
gets fine-grained. It makes a really big difference in practice.
The programmer is free to use this trick in any programming
language. But we haven't yet figured out a way to *avoid* the need
to do so. This continues to worry me to this day, because making the
right choices is black magic and specific to a particular combination
of algorithm and machine.
That said, there is some reason for optimism: the overhead of
creating work in Cilk is comparable to the overhead of creating a
thunk in Haskell.
The problem that Cilk runs into is that it's, well, C. It doesn't
deal with contention at well at all- a single microthread blocking
blocks the whole worker thread- meaning, among other things, that
you can have "false deadlocks", where one microthread blocks on
another microthread in the same real thread, and thus acts like
it's deadlocked even though it really isn't.
This is actually a fundamental problem with the threading model:
there is no guarantee of fairness using work stealing, so if you do
something that requires fair scheduling you get into a lot of trouble
fast. It's not fair to blame C for this. You have to be very
careful to define the interaction between fair IO-style threads and
unfair get-my-work-done threads.
You have greatly increased the likelyhood of raceconditions as well
(mutable data and multithreading just don't mix). Plus you have
all the normal fun you have with C bugs- wild pointers, buffer over
This, however, *is* C's fault. :-)
More on pH: we got our programs to scale, but had troubles going past
8 threads. We found ourselves limited by a non-parallel GC (my
fault, but labor-intensive to get right) and the absence of
parallelism in the underlying algorithms. For the latter problem
there simply is no magic bullet.
Haskell-Cafe mailing list