Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC: move BTLs out of ompi into separate layer
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-03-09 16:52:28

Random points in no particular order (Rainer please correct me if I'm
making bad assumptions):

- I believe that ORNL is proposing to do this work on a separate
branch (this is what we have discussed for some time now, and we
discussed this deeply in Louisville). The RFC text doesn't
specifically say, but I would be very surprised if this stuff is
planned to come back to the trunk in the near future -- as we have all
agreed, it's not done yet.

- I believe that the timeout field in RFC's is a limit for non-
responsiveness -- it is mainly intended to prevent people from
ignoring / not responding to RFCs. I do not believe that Rainer was
using that date as a "that's when I'm bringing it all back to the
trunk." Indeed, he specifically called out the 1.5 series as a target
for this work.

- I also believe that Rainer is using this RFC as a means to get
preliminary review of the work that has been done on the branch so
far. He has provided a script that shows what they plan to do, how
the code will be laid out, etc. There are still some important core
issues to be solved -- and, like Brian, I want to see how they'll get
solved before being happy (we have strong precedent for this
requirement) -- but I think all that Rainer was saying in his RFC was
"here's where we are so far; can people review and see if they hate it?"

- It was made abundantly clear in the Louisville meeting that ORTE has
no short-term plans for using the ONET layer (probably no long-term
plans, either, but hey -- never say "never" :-) ). The design of ONET
is such that other RTE's *could* use ONET if they want (e.g., STCI
will), but it is not a requirement for the underlying RTE to use
ONET. We agreed in Louisville that ORTE will provide sufficient stubs
and hooks (all probably effectively no-ops) so that ONET can compile
against it in the default OMPI configuration; other RTEs that want to
do more meaningful stuff will need to provide more meaningful
implementations of the stubs and hooks.

- Hopefully the teleconference time tomorrow works out for Rich (his
communications were unclear on this point). Otherwise, postponing the
admin discussion until April seems problematic.

On Mar 9, 2009, at 4:01 PM, Brian W. Barrett wrote:

> I, not suprisingly, have serious concerns about this RFC. It
> assumes that
> the ompi_proc issues and bootstrapping issues (the entire point of the
> move, as I understand it) can both be solved, but offer no proof to
> support that claim. Without those two issues solved, we would be left
> with an onet layer that is dependent on ORTE and OMPI, and which OMPI
> depends upon. This is not a good place to be. These issues should be
> resolved before an onet layer is created in the trunk.
> This is not an unusual requirement. The fault tolerance work took a
> very
> long time because of similar requirements. Not only was a full
> implementation required to prove performance would not be negatively
> impacted (when FT wasn't active), but we had discussions about its
> impact
> on code maintainability. We had a full implementation of all the
> pieces
> that impacted the code *before* any of it was allowed into the trunk.
> We should live by the rules the community has setup. They have
> served us
> well in the past. Further, these are not new objections on my part.
> Since the initial RFCs related to this move started, I have
> continually
> brought up the exact same questions and never gotten a satisfactory
> answer. This RFC even acknowledges the issues, but without
> presenting any
> solution and still asks to do the most disruptive work. I simply
> can't
> see how that fits with Open MPI's long-standing development
> proceedures.
> If all the issues I've asked about previously (which are essentially
> the
> ones you've identified in the RFC) can be solved, the impact to code
> base
> maintainability is reasonable, and the impact to performance is
> negligable, I'll gladly remove my objection to this RFC.
> Further, before any work on this branch is brought into the trunk, the
> admin-level discussion regarding this issue should be resolved. At
> this
> time, that discussion is blocking on ORNL and they've given April as
> the
> earliest such a discussion can occur. So at the very least, the RFC
> timeout should be pushed into April or ORNL should revise their
> availability for the admin discussion.
> Brian
> On Mon, 9 Mar 2009, Rainer Keller wrote:
> >
> > What: Move BTLs into separate layer
> >
> > Why: Several projects have expressed interest to use the
> BTLs. Use-cases
> > such as the RTE using the BTLs for modex or tools collecting/
> distributing data
> > in the fastest possible way may be possible.
> >
> > Where: This would affect several components, that the BTLs
> depend on
> > (namely allocator, mpool, rcache and the common part of the BTLs).
> > Additionally some changes to classes were/are necessary.
> >
> > When: Preferably 1.5 (in case we use the Feature/Stable
> Release cycle ;-)
> >
> > Timeout: 23.03.2009
> >
> ------------------------------------------------------------------------
> >
> > There has been much speculation about this project.
> > This RFC should shed some light, if there is some more information
> required,
> > please feel free to ask/comment. Of course, suggestions are welcome!
> >
> > The BTLs offer access to fast communication framework. Several
> projects have
> > expressed interest to use them separate of other layers of Open MPI.
> > Additionally (with further changes) BTLs maybe used within ORTE
> itself.
> >
> > The extraction is not easy (as was the extraction of ORTE and OMPI
> in the
> > early stages of Open MPI?).
> > In order to get as much input and be as visible as possible (e.g.
> in TRACS),
> > the tmp-branch for this work has been set up on:
> >
> >
> > We propose to have a separate ONET library living in onet, based
> on orte (see
> > attached fig).
> >
> > In order to keep the diff between the trunk and the branch to a
> minimum
> > several cleanup patches have already been applied to the trunk (e.g.
> > unnecessary #include of ompi and orte header files, integration of
> > ompi_bitmap_t into opal_bitmap_t, #include "*_config.h").
> >
> >
> > Additionally a script (attached below) has been kept up-to-date
> (contrib/move-
> > btl-into-onet), that will perform this separation on a fresh
> checkout of
> > trunk:
> > svn list
> > into-onet
> >
> > This script requires several patches (see attached TAR-ball).
> > Please update the variable PATCH_DIR to match the location of
> patches.
> >
> > ./move-btl-into-onet ompi-clean/
> > # Lots of output deleted.
> > cd ompi-clean/
> > rm -fr ompi/mca/common/ # No two mcas called common, too bad...
> > ./
> >
> >
> > A preliminary header file is provided in onet/include/rte.h to
> accommodate the
> > requirements of other RTEs (such as stci), that replaces selected
> > functionality, as proposed by Jeff and Ralph in the Louisville
> meeting.
> > Additionally, this header file is included before orte-header
> files (within
> > onet)...
> > By default, this does not change anything in the standard case
> (ORTE),
> > otherwise -DHAVE_STCI, redefinitions for components orte-
> functionality
> > required within onet is done.
> >
> >
> > TESTS:
> > First tests have been done locally on Linux/x86_64.
> > The branch compiles without warnings.
> > The wrappers have been updated.
> >
> > The Intel Testsuite runs without failures:
> > ./ all_tests_no_perf
> >
> >
> > !!!Before any merge, do extensive performance tests on real
> machines!!!
> > Initial tests on the cluster smoky, show no difference in
> comparison to ompi-
> > trunk.
> > Please see the enclosed output of NetPipe-3.7.1 run on a single
> node (--mca
> > btl sm,self) on smoky.
> >
> >
> > TODOS:
> > There are still some todos, to finalize this:
> > - Dependencies in the onet-layer into the ompi-layer (ompi_proc_t,
> > ompi_converter)
> > We are working on these, and have shortly talked about the latter
> with
> > George.
> > - Better abstraction from orte / cleanups, such as modex
> >
> > If these involve code-changes (and not just "save" and non-
> intrusive renames),
> > such as a opal_keyval-change, we will continue to write RFCs.
> >
> _______________________________________________
> devel mailing list
> devel_at_[hidden]

Jeff Squyres
Cisco Systems