On Jan 5, 2009, at 10:09 AM, Brian W. Barrett wrote:
> I think this sounds reasonable, if (and only if) MPI_Accumulate is
> properly handled. The interface for calling the op functions was
> broken in some fairly obvious way for accumulate when I was writing
> the one-sided code. I think I had to call some supposedly internal
> bits of the interface to make accumulate work. I can't remember
> what they are now, but I do remember it being a problem.
Coolio; I'll look into it.
> Of course, unless it makes mpi_allreduce on one double-sized
> floating point number using sum go faster, I'm not entirely sure a
> change is helpful ;).
From my (admittedly limited) understanding, since there are memory
registration and/or copy in/out issues with GPUs, the operation has to
be "big enough" and/or already located in GPU memory for the GPU to
outperform the CPU. It is my assumption that the component-ized CUDA/
OpenCL/whatever code will need to make a decision whether it should
perform the operation at run-time or pass it back to a fallback
[probably CPU-based] implementation, analogous to how "tuned" picks
the right coll algorithm.
I'm told that there's some researchy middleware working on exactly
this kind of problem (determining if a given operation is suitable to
run on the GPU or the main CPU). So in a best-case scenario, OMPI can
just link against and use that middleware rather than implementing all
the logic in the component itself. We'll see how it plays out.
My goal is to give these guys the infrastructure that they need in
OMPI to play with these kind of concepts and see what they can
accomplish in terms of real performance. FWIW: a few SC08 attendees
thought that they could avoid writing much CUDA/CL/whatever code if
MPI_REDUCE did the work for them (particularly if paired with the
proposed MPI_REDUCE_LOCAL function, https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/24)
. [shrug] We'll see!