Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC: move BTLs out of ompi into separate layer
From: Brian W. Barrett (brbarret_at_[hidden])
Date: 2009-03-11 13:31:59

On Wed, 11 Mar 2009, Richard Graham wrote:

> Brian,
> Going back over the e-mail trail it seems like you have raised two
> concerns:
> - BTL performance after the change, which I would take to be
> - btl latency
> - btl bandwidth
> - Code maintainability
> - repeated code changes that impact a large number of files
> - A demonstration that the changes actually achieve their goal. As we
> discussed after you got off the call, there are two separate goals here
> - being able to use the btl?s outside the context of mpi, but
> within the ompi code base
> - ability to use the btl?s in the context of a run-time other than
> orte
> Another concern I have heard raised by others is
> - mpi startup time
> Has anything else been missed here ?  I would like to make sure that we
> address all the issues raised in the next version of the RFC.

I think the umbrella concerns for the final success of the change are btl
performance (in particular, latency and message rates for cache-unfriendly
applications/benchmarks) and code maintainability. In addition, there are
some intermediate change issues I have, in that this project is working
different than other large changes. In particular, there is/was the
appearance of being asked to accept changes which only make sense if the
btl move is going to move forward, without any way to judge the
performance or code impact because critical technical issues still remain.

The latency/message rate issues are fairly straight forward from an end
measure point-of-view. My concerns on latency/message rate come not from
the movement of the BTL to another library (for most operating systems /
shared library systems that should be negligible), but from the code
changes which surround moving the BTLs. The BTLs are tightly intertwined
with a number of pieces of the OMPI layer, in particular the BML and MPool
frameworks and the ompi proc structure. I had a productive conversation
with Rainer this morning explaining why I'm so concerned about the bml and
ompi proc structures. The ompi proc structure currently acts not only as
the identifier for a remote endpoint, but stores endpoint specific data
for both the PML and BML. The BML structure actually contains each BTL's
per process endpoint information, in the form of the base_endpoint_t*
structures returned from add_procs(). Moving these structures around must
be done with care, as some of the proposals Jeff, Rainer, and I came up
with this morning either induced spaghetti code or greatly increased the
spread of information needed for the critical send path through the memory
space (thereby likely increasing cache misses on send for real

The code maintainability issue comes from three separate and independent
issues. First, there is the issue of how the pieces of the OMPI layer
will interact after the move. The BML/BTL/MPool/Rcache dance is already
complicated, and care should be taken to minimize that change. Start-up
is also already quite complex, and moving the BTLs to make them
independent of starting other pieces of Open MPI can be done well or can
be done poorly. We need to ensure it's done well, obviously. Second,
there is the issue of wire-up. My impression from conversations with
everyone at ORNL was that this move of BTLs would include changes to allow
BTLs to wire-up without the RML. I understand that Rich said this was not
the case during the part of the admin meeting I missed yesterday, so
that may no longer be a concern. Finally, there has been some discussion,
mainly second hand in my case, about the mechanisms in which the trunk
would be modified to allow for using OMPI without ORTE. I have concerns
that we'd add complexity to the BTLs to achieve that, and again that can
be done poorly if we're not careful. Talking with Jeff and Rainer this
morning helped reduce my concern in this area, but I think it also added
to the technical issues with must be solved to consider this project ready
for movement to the trunk.

There are a couple of technical issues which I believe prevent a
reasonable discussion of the performance and maintainability issues based
on the current branch. I talked about some of them in the previous two
paragraphs, but so that we have a short bullet list, they are:

   - How will the ompi_proc_t be handled? In particular,
     where will PML/BML data be stored, and how will we
     avoid adding new cache misses.
   - How will the BML and MPool be handled? The BML holds
     the BTL endpoint data, so changes have to be made if
     it continues to live in OMPI.
   - How will the modex and the intricate dance with adding
     new procs from dynamic processes be handled?
   - How will we handle the progress mechanisms in cases where
     the MTLs are used and the BTLs aren't needed by the RTE?
   - If there are users outside of OMPI, but who want to also use
     OMPI, how will the library versioning / conflict problem be

> As was mentioned before, our time frame for this is measured in weeks,
> and not in months.  I believe the date of May 1st was mentioned to
> coincide with the next feature release.

While I understand your deadline, we have in the past been very
conservative with such large changes. The C/R work was delayed for over a
year because people were concerned with the impact to performance and
maintainability. ORTE work is consistently delayed in the name of code
stability. I believe that changing our desire for high quality code in
the trunk because of an organization's deadline (particularly when other
organizations are successfully using branches to meet their deadlines)
sets a poor precedent and goes against previous precedents.

Similarly, my concern with the intermediate changes which have been
proposed or occurred come from the slippery-slope argument. Changes which
are really only necessary for the btl move (even general code cleanups)
should only occur once we're all sure the btl move will work. Otherwise,
we're impacting other developers (many of who are working on temp branches
attempting to get a feature to completion, as our normal process
dictates) in order to reach an end point which may not be achievable. In
talking to Rainer this morning with Jeff, I think we came up with a number
of ideas on how to mitigate this impact and find a better balance which
allows ORNL to answer the critical technical questions (which are not just
mine, but are shared by others and are critical to the "make it work" part
of the process) and allows the rest of the community some belief that we
can avoid any permanent harm if the move doesn't work out.

> One thing that should help when the naming changes are applied is that
> this is scripted, and the script can be made available for others that
> are working on temp branches ? which includes us, also.

That unfortunately doesn't help other developers, if they're trying to
strictly follow the version control changes to the trunk. The problem is
that we're going to get all those moves (hopefully the script now svn moves
instead of svn rm / svn add) through the version control system. The
script would then cause all the changes to occur a second time, and that
could be very problematic. The problem with the version control changes
filtering down is that it is not all-encompassing. For example, svn will
have problems if the btl directory moves but I have my own private special
BTL. Yes, i might be able to use your scripts to handle that, but if they
aren't written with that scenario in mind, they won't help. It also won't
help if I've added a particular file to an existing BTL and the BTL then

I think these cases are worth the pain to non-ORNL developers *IF* all the
other issues are addressed. Otherwise, we're unfairly asking them to deal
with a radically changing code base for an incomplete project, a situation
we've worked to avoid in the past.

Hopefully this explains my thoughts on the btl move. I'm not opposed to
the move itself (although I reserve the right to become opposed, based on
performance and maintainability issues). I have a problem with the change
in process from previous large, invasive changes.

Hope this helps,