Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI devel] RFC: Component-izing MPI_Op
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-01-13 11:58:39

On the call today, no one had any objections to bringing this stuff to
the trunk. v1.2.9 and v1.3.0 releases have a higher priority, so I'll
bring this stuff over to the trunk when those two releases are done
(hopefully tomorrow!).

On Jan 10, 2009, at 2:21 PM, Jeff Squyres wrote:

> FWIW, I've finished a first cut of this stuff. I'll provide an
> overview on next Tuesday's teleconf.
> I didn't "fix" MPI_REPLACE yet (it does seem to be a different
> issue; I mainly extended what was already there) but I've done most
> of the rest of the work:
> - Created a new op framework that was inspired by the coll framework.
> - Similar to the "coll" framework, the op framework supports:
> - Mixing-n-matching op modules on a single MPI_Op
> - "Stacking" op modules (e.g., choose at invocation time whether
> a module will use its back-end hardware, or whether it should fall
> back to a different module's implementation)
> - Unlike the coll framework, all the "basic" functions are in the op
> base and are pre-loaded onto the MPI_Op during selection as the 0th
> priority (so you can stack them naturally -- base functions even
> have a [bogus] module, so you can RETAIN them just like any other
> module) -- there's no "basic" component or set of modules.
> - Created an "example" op component that has a few sample routines
> and shows a bunch of different OMPI concepts, both in the op
> framework and utilizing other parts of the OMPI code base (hopefully
> helpful to newbie OMPI component authors).
> ==> NOTE: The example op is currently fairly chatty with
> opal_output() so that you can see that it is being used.
> I'll .ompi_ignore it (or something) when it is brought into the
> trunk so that the example component isn't active in production runs.
> - Created wiki pages describing autogen, how to create a framework,
> and how to create a component (hopefully helpful to newbie OMPI
> component authors).
> =======================
> I think that the second phase of this work will be the various
> hardware providers providing their components to Open MPI (e.g.,
> cuda, opencl, IBM Cell, ...etc.).
> If this all proves worthwhile, I think a third phase of this work
> could be optimizing the top-level reduction calls based on what
> nodes have hardware acceleration and which do not (e.g., if
> accelerators are not available in all nodes, that may changes the
> collection/reduction communication pattern).
> On Jan 5, 2009, at 10:21 AM, Jeff Squyres wrote:
>> On Jan 5, 2009, at 10:09 AM, Brian W. Barrett wrote:
>>> I think this sounds reasonable, if (and only if) MPI_Accumulate is
>>> properly handled. The interface for calling the op functions was
>>> broken in some fairly obvious way for accumulate when I was
>>> writing the one-sided code. I think I had to call some supposedly
>>> internal bits of the interface to make accumulate work. I can't
>>> remember what they are now, but I do remember it being a problem.
>> Coolio; I'll look into it.
>>> Of course, unless it makes mpi_allreduce on one double-sized
>>> floating point number using sum go faster, I'm not entirely sure a
>>> change is helpful ;).
>> From my (admittedly limited) understanding, since there are memory
>> registration and/or copy in/out issues with GPUs, the operation has
>> to be "big enough" and/or already located in GPU memory for the GPU
>> to outperform the CPU. It is my assumption that the component-ized
>> CUDA/OpenCL/whatever code will need to make a decision whether it
>> should perform the operation at run-time or pass it back to a
>> fallback [probably CPU-based] implementation, analogous to how
>> "tuned" picks the right coll algorithm.
>> I'm told that there's some researchy middleware working on exactly
>> this kind of problem (determining if a given operation is suitable
>> to run on the GPU or the main CPU). So in a best-case scenario,
>> OMPI can just link against and use that middleware rather than
>> implementing all the logic in the component itself. We'll see how
>> it plays out.
>> My goal is to give these guys the infrastructure that they need in
>> OMPI to play with these kind of concepts and see what they can
>> accomplish in terms of real performance. FWIW: a few SC08
>> attendees thought that they could avoid writing much CUDA/CL/
>> whatever code if MPI_REDUCE did the work for them (particularly if
>> paired with the proposed MPI_REDUCE_LOCAL function,
>> . [shrug] We'll see!
>> --
>> Jeff Squyres
>> Cisco Systems
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
> --
> Jeff Squyres
> Cisco Systems
> _______________________________________________
> devel mailing list
> devel_at_[hidden]

Jeff Squyres
Cisco Systems