Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Preparations for moving the btl's
From: Richard Graham (rlgraham_at_[hidden])
Date: 2008-12-04 15:01:04


On 12/4/08 2:28 PM, "Ralph Castain" <rhc_at_[hidden]> wrote:

> I guess you lost me on this one. How are the btl's going to push an error "up"
> to a higher layer? The errors could contain an arbitrary amount of information
> in them. Since the btl API's currently only return ints, are you proposing
> that we change all the btl APIs to include a new error structure so we can
> pass detailed error information back to the caller?
>
>>> >> yes, this is what I am proposing.
>
> Then the MPI layer would have to call the orte_notifier with the appropriate
> info, since the MPI layer doesn't have the necessary communications
> infrastructure itself to perform the required functions. This would mean that
> every place that calls the BTL's would have to deal with the new API and
> returned error structure, and call orte_notifier if an error was reported.
>
>>> >> no more than adding it to the btl layer. I think the btl should remain
>>> as simple as possible. There is actually
>>> >> precedent for this in other contexts. Since the notifier is
>>> componentized, I am assuming it is not exposing the
>>> >> communication details to the calling layer. Also, ³every place we call
>>> the btl² is not a large number, and is
>>> >> confined to a small number of components.
>
> Seems like this would proliferate quickly, while having the error reporting
> mechanism right where the error occurs represents the minimal impact and
> maximum flexibility.
>
>>> >> more flexibility is obtained if the data is passed up the call stack, and
>>> handled by the layer that wants to.
>
> Rich
>
>
> On Dec 4, 2008, at 12:07 PM, Richard Graham wrote:
>
>> Not exactly, it depends on what you push up the stack. If you push just an
>> error code, than you are right, there is very little value. However, if you
>> push up the error strings (or something like that), and have an upper layer
>> interact with SLURM or Moab¹s error reporting system, the btl¹s don¹t need to
>> learn about and depend on a new interface.
>>
>> Rich
>>
>>
>> On 12/4/08 12:47 PM, "Brian W. Barrett" <brbarret_at_[hidden]> wrote:
>>
>>
>>> That was my thought exactly. And since the point of the notifier
>>> component is to return a *useful* description of what failure the BTL had
>>> (like IB ran out of resource X again), that will be lost if we just push
>>> that up to the next layer.
>>>
>>> Just my $0.02, of course.
>>>
>>> Brian
>>>
>>> On Thu, 4 Dec 2008, Ralph Castain wrote:
>>>
>>>> > Hmmm...only problem with that idea is that the entity being communicated
>>>> > to (e.g., SLURM, Moab) have no concept of MPI nor any way to communicate
>>>> > via that system. They do, however, have APIs that notifier can call, and
>>>> > know how to speak TCP via their own agreed-upon protocols. And many
>>>> large
>>>> > systems turn off the TCP btl (all of ours, for example) because it isn't
>>>> > needed and opens additional unnecessary ports.
>>>> > So calling APIs and/or sending messages across the OOB are pretty
>>>> > straight forward. Teaching Moab to understand btl/datatype engine
>>>> > messages (flowing across who knows what transport) is an unlikely thing
>>>> > to happen.
>>>> >
>>>> > Besides, one of the primary reasons for needing to call notifier is a
>>>> > failure in the btl - so relying on the btl to send the message is
>>>> > self-defeating.
>>>> >
>>>> >
>>>> > On Dec 4, 2008, at 10:37 AM, Richard Graham wrote:
>>>> >
>>>> > Here is where I think we should reconsider accessing the
>>>> > notifier component in the btl. It creates dependencies in
>>>> > the btl that are not needed. The idea of a notifier
>>>> > component is a good one, but I would defer using it to upper
>>>> > layers, rather than embedding it in the guts of the
>>>> > communication system. I would be in favor of an approach
>>>> > that sends the information up the call stack. The btl?s should
>>>> > not depend on other communication primitives, as they are the
>>>> > communication primitive.
>>>> >
>>>> > Rich
>>>> >
>>>> >
>>>> > On 12/4/08 9:04 AM, "Ralph Castain" <rhc_at_[hidden]> wrote:
>>>> >
>>>> > Yes, FTB utilizes the notifier framework. In
>>>> > addition, we have three
>>>> > other components getting ready to be added to
>>>> > that framework that will
>>>> > provide interfaces to Moab, SLURM, and a DOE
>>>> > monitoring program. The
>>>> > first two will require messaging capabilities to
>>>> > tell the schedulers
>>>> > about problem nodes/routes. The latter will also
>>>> > use a messaging
>>>> > protocol, but is mostly aimed at alerting
>>>> > operators to a problem and
>>>> > creating a historical archive.
>>>> >
>>>> > That said, we can expect the use of
>>>> > orte_notifier to spread across
>>>> > the BTL's pretty aggressively in the next few
>>>> > months, and for the
>>>> > notifier API to change/expand as we address these
>>>> > needs.
>>>> >
>>>> > On Dec 4, 2008, at 6:13 AM, Jeff Squyres wrote:
>>>> >
>>>>> > > I think you got it right. And I think we're
>>>> > pretty good in terms of
>>>>> > > BTL usage of ORTE and OPAL (to include the new
>>>> > "notifier" service
>>>>> > > that Ralph put in recently -- what the FTB will
>>>> > likely eventually
>>>>> > > use, I think...?); those interfaces and
>>>> > abstraction barriers are
>>>>> > > technologically enforced. If you break the
>>>> > abstractions, the linker
>>>>> > > will swiftly and unmercifully punish you.
>>>> > (this was exactly [one
>>>>> > > of] the rationale that we used for splitting
>>>> > the code base into
>>>>> > > OPAL, ORTE, and OMPI several years ago)
>>>>> > >
>>>>> > > Greg has already noted on the wiki a few
>>>> > constants used in the BTL's
>>>>> > > that have an OMPI_ prefix that aren't really
>>>> > OMPI values (e.g.,
>>>>> > > OMPI_ENABLE_HETEROGENEOUS_SUPPORT). These come
>>>> > from configure
>>>>> > > (i.e., opal/include/opal_config.h) and were not
>>>> > renamed back when we
>>>>> > > split the code base into OPAL, ORTE, and OMPI.
>>>> > I don't think we had
>>>>> > > a strong reason for not renaming them -- most
>>>> > could probably be
>>>>> > > renamed to OPAL_* -- we just didn't do it then.
>>>> > Perhaps they can be
>>>>> > > changed during the BTL extraction process (I
>>>> > noted this on the wiki).
>>>>> > >
>>>>> > >
>>>>> > >
>>>>> > > On Dec 3, 2008, at 9:43 PM, Richard Graham
>>>> > wrote:
>>>>> > >
>>>>>> > >> BTW,
>>>>>> > >> I was guessing FTB is Fault Tolerant
>>>> > Backbone, but if not, can
>>>>>> > >> someone tell me what it is ? If it is not the
>>>> > later, what I just
>>>>>> > >> wrote about it makes no sense.
>>>>>> > >>
>>>>>> > >> Rich
>>>>>> > >>
>>>>>> > >>
>>>>>> > >> On 12/3/08 9:34 PM, "Richard Graham"
>>>> > <rlgraham_at_[hidden]> wrote:
>>>>>> > >>
>>>>>>> > >>> The goal is to use the btl?s outside of the
>>>> > context of MPI, which
>>>>>>> > >>> was what was in mind from the day the ompi
>>>> > work started over five
>>>>>>> > >>> years ago, but with no other use at the time,
>>>> > things grew up
>>>>>>> > >>> intermingled ? no surprise at all. What we are
>>>> > attempting to do
>>>>>>> > >>> is to untangle the existing dependencies, and
>>>> > make a much cleaner
>>>>>>> > >>> distinction between how/what data is passed
>>>> > between layers.
>>>>>>> > >>>
>>>>>>> > >>> I expect this will involve some sort of well
>>>> > defined interface
>>>>>>> > >>> between the btl?s and orte, and I don?t know if
>>>> > this will also
>>>>>>> > >>> require something like this between the btl?s
>>>> > and the pml ? I
>>>>>>> > >>> think that interface is rigidly enforced, but
>>>> > am not sure.
>>>>>>> > >>>
>>>>>>> > >>> I expect that explicit calls to FTB in the
>>>> > btl layer would have to
>>>>>>> > >>> be componentized, especially in the context
>>>> > of what is developing
>>>>>>> > >>> in the FT working group of the MPI Forum.
>>>> > Not that FTB is bad in
>>>>>>> > >>> any way, just that it is one of many
>>>> > monitors.
>>>>>>> > >>>
>>>>>>> > >>> We will need to talk about this on a case by
>>>> > case basis, and
>>>>>>> > >>> decide how to proceed. If anyone wants to
>>>> > help, please do.
>>>>>>> > >>>
>>>>>>> > >>> Rich
>>>>>>> > >>>
>>>>>>> > >>>
>>>>>>> > >>> On 12/3/08 3:02 PM, "Ralph Castain"
>>>> > <rhc_at_[hidden]> wrote:
>>>>>>> > >>>
>>>>>>>> > >>>> I managed to execute the modex-less changes
>>>> > pretty much without
>>>>>>>> > >>>> introducing additional ORTE dependencies
>>>> > into the BTL's, though
>>>>>>>> > >>>> there
>>>>>>>> > >>>> may be some additions as we look a the other
>>>> > BTLs that I didn't
>>>>>>>> > >>>> address. So hopefully that won't contribute
>>>> > too much to the issue
>>>>>>>> > >>>> here.
>>>>>>>> > >>>>
>>>>>>>> > >>>> At the moment, I don't think it matters
>>>> > where notifier sits - it
>>>>>>>> > >>>> might
>>>>>>>> > >>>> be able to move to OPAL. Only catch will be
>>>> > if some notifier
>>>>>>>> > >>>> component
>>>>>>>> > >>>> requires communications. I'm thinking of
>>>> > FTB, for example, and
>>>>>>>> > >>>> our own
>>>>>>>> > >>>> local monitoring program that may require
>>>> > TCP messaging. We don't
>>>>>>>> > >>>> currently have anything in OPAL that would
>>>> > support an OPAL level
>>>>>>>> > >>>> messaging system, though perhaps that could
>>>> > be resolved.
>>>>>>>> > >>>>
>>>>>>>> > >>>> We also have dependencies where the BTL's
>>>> > will call orte_ess to
>>>>>>>> > >>>> find
>>>>>>>> > >>>> out what node another proc is on, the node
>>>> > local rank of that proc,
>>>>>>>> > >>>> etc. Those dependencies are likely to grow
>>>> > after the Dec meeting
>>>>>>>> > >>>> (see
>>>>>>>> > >>>> wiki for that agenda item), and definitely
>>>> > cannot be moved to OPAL.
>>>>>>>> > >>>>
>>>>>>>> > >>>> However, note that Rich stated the BTL's
>>>> > were -not- moving to OPAL.
>>>>>>>> > >>>> This begs the question: where -are- they
>>>> > going? Into their own
>>>>>>>> > >>>> layer?
>>>>>>>> > >>>> Will that layer be somewhere in-between OMPI
>>>> > and ORTE (in which
>>>>>>>> > >>>> case,
>>>>>>>> > >>>> the ORTE dependencies are moot)?
>>>>>>>> > >>>>
>>>>>>>> > >>>> I note that the wiki page doesn't address
>>>> > any of these questions,
>>>>>>>> > >>>> which is understandable if things are just
>>>> > getting underway. But it
>>>>>>>> > >>>> does sound like this is going to take some
>>>> > thought to ensure we
>>>>>>>> > >>>> don't
>>>>>>>> > >>>> paint ourselves into a corner.
>>>>>>>> > >>>>
>>>>>>>> > >>>> Ralph
>>>>>>>> > >>>>
>>>>>>>> > >>>>
>>>>>>>> > >>>> On Dec 3, 2008, at 12:10 PM, Jeff Squyres
>>>> > wrote:
>>>>>>>> > >>>>
>>>>>>>>> > >>>> > FWIW, I see lots of notifier calls being
>>>> > added to the BTLs (and
>>>>>>>>> > >>>> > elsewhere throughout the OMPI code base)
>>>> > over time...
>>>>>>>>> > >>>> >
>>>>>>>>> > >>>> > On Dec 3, 2008, at 2:07 PM, Tim Mattox
>>>> > wrote:
>>>>>>>>> > >>>> >
>>>>>>>>>> > >>>> >> The BTLs might have added calls to the
>>>> > notifier framework in
>>>>>>>> > >>>> their
>>>>>>>>>> > >>>> >> error paths.
>>>>>>>>>> > >>>> >> The notifier framework is currently in
>>>> > the ORTE layer... not
>>>>>>>> > >>>> sure
>>>>>>>>>> > >>>> >> if we could
>>>>>>>>>> > >>>> >> move it down to OPAL. Ralph, any
>>>> > thoughts on that?
>>>>>>>>>> > >>>> >>
>>>>>>>>>> > >>>> >> On Wed, Dec 3, 2008 at 11:56 AM, Richard
>>>> > Graham <rlgraham_at_[hidden]
>>>>>>>>> > >>>> >
>>>>>>>>>> > >>>> >> wrote:
>>>>>>>>>>> > >>>> >>> George told me about what he is doing,
>>>> > so no changes would be
>>>>>>>>>>> > >>>> >>> committed
>>>>>>>>>>> > >>>> >>> until George has his changes in.
>>>>>>>>>>> > >>>> >>>
>>>>>>>>>>> > >>>> >>> Are there other changes to the btl's
>>>> > that we should be aware
>>>>>>>> > >>>> of ?
>>>>>>>>>>> > >>>> >>>
>>>>>>>>>>> > >>>> >>> Rich
>>>>>>>>>>> > >>>> >>>
>>>>>>>>>>> > >>>> >>>
>>>>>>>>>>> > >>>> >>> On 12/3/08 11:47 AM, "George Bosilca"
>>>> > <bosilca_at_[hidden]>
>>>>>>>> > >>>> wrote:
>>>>>>>>>>> > >>>> >>>
>>>>>>>>>>>> > >>>> >>>> Terry,
>>>>>>>>>>>> > >>>> >>>>
>>>>>>>>>>>> > >>>> >>>> I'm involved [at some degree] in both
>>>> > efforts and I can
>>>>>>>> > >>>> confirm
>>>>>>>>>>>> > >>>> >>>> these
>>>>>>>>>>>> > >>>> >>>> two efforts will not affect each other
>>>> > in any bad way.
>>>>>>>>>>>> > >>>> >>>>
>>>>>>>>>>>> > >>>> >>>> george.
>>>>>>>>>>>> > >>>> >>>>
>>>>>>>>>>>> > >>>> >>>> On Dec 3, 2008, at 11:42 , Terry Dontje
>>>> > wrote:
>>>>>>>>>>>> > >>>> >>>>
>>>>>>>>>>>>> > >>>> >>>>> I don't have any *strong* objections.
>>>> > However, I know that
>>>>>>>> > >>>> Eugene
>>>>>>>>>>>>> > >>>> >>>>> and George B have been working on some
>>>> > Fastpath code changes
>>>>>>>>>>>>> > >>>> >>>>> that we
>>>>>>>>>>>>> > >>>> >>>>> should make sure neither project
>>>> > obliterates the other.
>>>>>>>>>>>>> > >>>> >>>>>
>>>>>>>>>>>>> > >>>> >>>>> --td
>>>>>>>>>>>>> > >>>> >>>>>
>>>>>>>>>>>>> > >>>> >>>>> Richard Graham wrote:
>>>>>>>>>>>>>> > >>>> >>>>>> Now that 1.3 will be released, we
>>>> > would like to go ahead
>>>>>>>> > >>>> with the
>>>>>>>>>>>>>> > >>>> >>>>>> plan to move the btl?s out of the MPI
>>>> > layer. Greg Koenig
>>>>>>>> > >>>> who is
>>>>>>>>>>>>>> > >>>> >>>>>> doing most of the work has started a
>>>> > wiki page with
>>>>>>>> > >>>> details on
>>>>>>>>>>>>>> > >>>> >>>>>> the
>>>>>>>>>>>>>> > >>>> >>>>>> plans. Right now details are sketchy,
>>>> > as Greg is digging
>>>>>>>> > >>>> through
>>>>>>>>>>>>>> > >>>> >>>>>> the code, and has only hand written
>>>> > notes on data
>>>>>>>> > >>>> structures that
>>>>>>>>>>>>>> > >>>> >>>>>> need to be moved, include files that
>>>> > are not needed, etc.
>>>>>>>> > >>>> The
>>>>>>>>>>>>>> > >>>> >>>>>> page
>>>>>>>>>>>>>> > >>>> >>>>>> is at:
>>>>>>>>>>>>>> > >>>> >>>>>>
>>>> > _https://svn.open-mpi.org/trac/ompi/wiki/BTLExtraction_
>>>>>>>>>>>>>> > >>>> >>>>>>
>>>>>>>>>>>>>> > >>>> >>>>>> The first three steps basically only
>>>> > involve code motion,
>>>>>>>> > >>>> moving
>>>>>>>>>>>>>> > >>>> >>>>>> items such as ompi_list, and renaming
>>>> > them, moving where
>>>>>>>> > >>>> the code
>>>>>>>>>>>>>> > >>>> >>>>>> is actually located in the
>>>> > repository, and the like. For
>>>>>>>> > >>>> these we
>>>>>>>>>>>>>> > >>>> >>>>>> do not plan to put out a formal RFC,
>>>> > but comments are very
>>>>>>>>>>>>>> > >>>> >>>>>> welcome,
>>>>>>>>>>>>>> > >>>> >>>>>> and any hands that are willing to
>>>> > help with this are even
>>>>>>>> > >>>> more
>>>>>>>>>>>>>> > >>>> >>>>>> welcome.
>>>>>>>>>>>>>> > >>>> >>>>>>
>>>>>>>>>>>>>> > >>>> >>>>>> The last phase where the btl?s are
made
>>>> > dependent on OPAL,
>>>>>>>> > >>>> and
>>>>>>>>>>>>>> > >>>> >>>>>> supporting libraries such as mpools I
>>>> > expect will be
>>>>>>>> > >>>> disruptive,
>>>>>>>>>>>>>> > >>>> >>>>>> and will definitely require an RFC,
>>>> > and will also be a
>>>>>>>> > >>>> longer
>>>>>>>>>>>>>> > >>>> >>>>>> process.
>>>>>>>>>>>>>> > >>>> >>>>>>
>>>>>>>>>>>>>> > >>>> >>>>>> Please send comments,
>>>>>>>>>>>>>> > >>>> >>>>>> Rich
>>>>>>>>>>>>>> > >>>> >>>>>>
>>>>>>>> > >>>>
>>>> >
>>>> ------------------------------------------------------------------------
>>>>>>>>>>>>>> > >>>> >>>>>>
>>>>>>>>>>>>>> > >>>> >>>>>>
>>>> > _______________________________________________
>>>>>>>>>>>>>> > >>>> >>>>>> devel mailing list
>>>>>>>>>>>>>> > >>>> >>>>>> devel_at_[hidden]
>>>>>>>>>>>>>> > >>>> >>>>>>
>>>> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>>>> > >>>> >>>>>>
>>>>>>>>>>>>> > >>>> >>>>>
>>>>>>>>>>>>> > >>>> >>>>>
>>>> > _______________________________________________
>>>>>>>>>>>>> > >>>> >>>>> devel mailing list
>>>>>>>>>>>>> > >>>> >>>>> devel_at_[hidden]
>>>>>>>>>>>>> > >>>> >>>>>
>>>> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>> > >>>> >>>>
>>>>>>>>>>>> > >>>> >>>>
>>>>>>>>>>>> > >>>> >>>>
>>>> > _______________________________________________
>>>>>>>>>>>> > >>>> >>>> devel mailing list
>>>>>>>>>>>> > >>>> >>>> devel_at_[hidden]
>>>>>>>>>>>> > >>>> >>>>
>>>> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>> > >>>> >>>
>>>>>>>>>>> > >>>> >>>
>>>>>>>>>>> > >>>> >>>
>>>> > _______________________________________________
>>>>>>>>>>> > >>>> >>> devel mailing list
>>>>>>>>>>> > >>>> >>> devel_at_[hidden]
>>>>>>>>>>> > >>>> >>>
>>>> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>> > >>>> >>>
>>>>>>>>>> > >>>> >>
>>>>>>>>>> > >>>> >>
>>>>>>>>>> > >>>> >>
>>>>>>>>>> > >>>> >> --
>>>>>>>>>> > >>>> >> Tim Mattox, Ph.D. -
>>>> > http://homepage.mac.com/tmattox/
>>>>>>>>>> > >>>> >> tmattox_at_[hidden] ||
>>>> > timattox_at_[hidden]
>>>>>>>>>> > >>>> >> I'm a bright...
>>>> > http://www.the-brights.net/
>>>>>>>>>> > >>>> >>
>>>>>>>>>> > >>>> >>
>>>> > _______________________________________________
>>>>>>>>>> > >>>> >> devel mailing list
>>>>>>>>>> > >>>> >> devel_at_[hidden]
>>>>>>>>>> > >>>> >>
>>>> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>> > >>>> >
>>>>>>>>> > >>>> >
>>>>>>>>> > >>>> > --
>>>>>>>>> > >>>> > Jeff Squyres
>>>>>>>>> > >>>> > Cisco Systems
>>>>>>>>> > >>>> >
>>>>>>>>> > >>>> >
>>>>>>>>> > >>>> >
>>>> > _______________________________________________
>>>>>>>>> > >>>> > devel mailing list
>>>>>>>>> > >>>> > devel_at_[hidden]
>>>>>>>>> > >>>> >
>>>> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>> > >>>>
>>>>>>>> > >>>>
>>>>>>>> > >>>>
>>>> > _______________________________________________
>>>>>>>> > >>>> devel mailing list
>>>>>>>> > >>>> devel_at_[hidden]
>>>>>>>> > >>>>
>>>> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>> > >>>>
>>>>>>> > >>>
>>>>>>> > >>>
>>>> > _______________________________________________
>>>>>>> > >>> devel mailing list
>>>>>>> > >>> devel_at_[hidden]
>>>>>>> > >>>
>>>> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>> > >>
>>>> > _______________________________________________
>>>>>> > >> devel mailing list
>>>>>> > >> devel_at_[hidden]
>>>>>> > >>
>>>> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>> > >
>>>>> > >
>>>>> > > --
>>>>> > > Jeff Squyres
>>>>> > > Cisco Systems
>>>>> > >
>>>>> > >
>>>>> > > _______________________________________________
>>>>> > > devel mailing list
>>>>> > > devel_at_[hidden]
>>>>> > >
>>>> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> >
>>>> >
>>>> > _______________________________________________
>>>> > devel mailing list
>>>> > devel_at_[hidden]
>>>> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> >
>>>> > _______________________________________________
>>>> > devel mailing list
>>>> > devel_at_[hidden]
>>>> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> >
>>>> >
>>>> >
>>>> >
>>>
>>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel