On the call today, no one had any objections to bringing this stuff to
the trunk. v1.2.9 and v1.3.0 releases have a higher priority, so I'll
bring this stuff over to the trunk when those two releases are done
On Jan 10, 2009, at 2:21 PM, Jeff Squyres wrote:
> FWIW, I've finished a first cut of this stuff. I'll provide an
> overview on next Tuesday's teleconf.
> I didn't "fix" MPI_REPLACE yet (it does seem to be a different
> issue; I mainly extended what was already there) but I've done most
> of the rest of the work:
> - Created a new op framework that was inspired by the coll framework.
> - Similar to the "coll" framework, the op framework supports:
> - Mixing-n-matching op modules on a single MPI_Op
> - "Stacking" op modules (e.g., choose at invocation time whether
> a module will use its back-end hardware, or whether it should fall
> back to a different module's implementation)
> - Unlike the coll framework, all the "basic" functions are in the op
> base and are pre-loaded onto the MPI_Op during selection as the 0th
> priority (so you can stack them naturally -- base functions even
> have a [bogus] module, so you can RETAIN them just like any other
> module) -- there's no "basic" component or set of modules.
> - Created an "example" op component that has a few sample routines
> and shows a bunch of different OMPI concepts, both in the op
> framework and utilizing other parts of the OMPI code base (hopefully
> helpful to newbie OMPI component authors).
> ==> NOTE: The example op is currently fairly chatty with
> opal_output() so that you can see that it is being used.
> I'll .ompi_ignore it (or something) when it is brought into the
> trunk so that the example component isn't active in production runs.
> - Created wiki pages describing autogen, how to create a framework,
> and how to create a component (hopefully helpful to newbie OMPI
> component authors).
> I think that the second phase of this work will be the various
> hardware providers providing their components to Open MPI (e.g.,
> cuda, opencl, IBM Cell, ...etc.).
> If this all proves worthwhile, I think a third phase of this work
> could be optimizing the top-level reduction calls based on what
> nodes have hardware acceleration and which do not (e.g., if
> accelerators are not available in all nodes, that may changes the
> collection/reduction communication pattern).
> On Jan 5, 2009, at 10:21 AM, Jeff Squyres wrote:
>> On Jan 5, 2009, at 10:09 AM, Brian W. Barrett wrote:
>>> I think this sounds reasonable, if (and only if) MPI_Accumulate is
>>> properly handled. The interface for calling the op functions was
>>> broken in some fairly obvious way for accumulate when I was
>>> writing the one-sided code. I think I had to call some supposedly
>>> internal bits of the interface to make accumulate work. I can't
>>> remember what they are now, but I do remember it being a problem.
>> Coolio; I'll look into it.
>>> Of course, unless it makes mpi_allreduce on one double-sized
>>> floating point number using sum go faster, I'm not entirely sure a
>>> change is helpful ;).
>> From my (admittedly limited) understanding, since there are memory
>> registration and/or copy in/out issues with GPUs, the operation has
>> to be "big enough" and/or already located in GPU memory for the GPU
>> to outperform the CPU. It is my assumption that the component-ized
>> CUDA/OpenCL/whatever code will need to make a decision whether it
>> should perform the operation at run-time or pass it back to a
>> fallback [probably CPU-based] implementation, analogous to how
>> "tuned" picks the right coll algorithm.
>> I'm told that there's some researchy middleware working on exactly
>> this kind of problem (determining if a given operation is suitable
>> to run on the GPU or the main CPU). So in a best-case scenario,
>> OMPI can just link against and use that middleware rather than
>> implementing all the logic in the component itself. We'll see how
>> it plays out.
>> My goal is to give these guys the infrastructure that they need in
>> OMPI to play with these kind of concepts and see what they can
>> accomplish in terms of real performance. FWIW: a few SC08
>> attendees thought that they could avoid writing much CUDA/CL/
>> whatever code if MPI_REDUCE did the work for them (particularly if
>> paired with the proposed MPI_REDUCE_LOCAL function, https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/24)
>> . [shrug] We'll see!
>> Jeff Squyres
>> Cisco Systems
>> devel mailing list
> Jeff Squyres
> Cisco Systems
> devel mailing list