Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Fwd: OpenMPI changes
From: Ralph Castain (rhc_at_[hidden])
Date: 2008-03-04 19:26:11

I'll try to get the code into the trunk before I go on vacation for a week
on Fri. If not, I'll let you know and get it the week I get back (3/17).
Basically, all I do is define an event in our event library that "fires" to
send a message to you when the defined trigger occurs.

If that is all you need, though, we could explore providing callback
capability inside of ORTE for it. The data you are looking for is sitting in
objects inside a set of global arrays - you could easily pick that off. We
could try having the event call your callback function and let you do
something with the data (I imagine display the status or something?).

We would just need to be careful about not getting locked in there and
blocking other events from "firing" as they are needed to do things like
respond to errors.

I'm willing to create it on a tmp branch to try when I get back, if that's
something you want to do. Either way, we'll have the tool library interface
to fall back upon as it is needed by other tools.


On 3/4/08 4:41 PM, "Greg Watson" <g.watson_at_[hidden]> wrote:

> Ralph,
> Looking at PTP, the only thing we need is to query the process
> information (PID, rank, node) when the job is created. Perhaps if only
> queries are allowed from callbacks then recursion would be eliminated?
> If you can get this functionality into your new interface and back in
> the trunk, I take a look at porting PTP to use it.
> Thanks,
> Greg
> On Mar 4, 2008, at 6:14 PM, Ralph Castain wrote:
>> Yeah, the problem we had in the past was:
>> 1. something would trigger in the system - e.g., a particular job
>> state was
>> reached. This would cause us to execute a callback function via the
>> GPR
>> 2. the callback function would take some action. Typically, this
>> involved
>> sending out a message or calling another function. Either way, the
>> eventual
>> result of that action would be to cause another GPR trigger to fire
>> - either
>> the job or a process changing state
>> This loop would continue ad infinitum. Sometimes, I would see stack
>> traces
>> hundreds of calls deep. Debugging and maintaining something that
>> intertwined
>> was impossible.
>> People tried to impose order by establishing rules about what could
>> and
>> could not be called from various situations, but that also proved
>> intractable. Problem was that we could get it to work for a "normal"
>> code
>> path, but all the variety of failure modes, combined with all the
>> flexibility built into the code base, created so many code paths
>> that you
>> inevitably wound up deadlocked under some corner case conditions.
>> Which we generally agreed was unacceptable.
>> It -is- possible to have callback functions that avoid this situation.
>> However, it is very easy to make a mistake and "hang" the whole
>> system. Just
>> seemed easier to avoid the entire problem. (I don't get that option!)
>> The ability to get an allocation without launching is easy to add.
>> I/O forwarding is currently an issue. Our IOF doesn't seem to like
>> it when I
>> try to create an "alternate" tap (the default always goes back
>> through the
>> persistent orted, so the tool looks like a second "tap" on the
>> flow). This
>> is noted as a "bug" on our tracker, and I expect it will be
>> addressed prior
>> to releasing 1.3. I will ask that it be raised in priority.
>> I'll review what I had done and see about bringing it into the trunk
>> by the
>> end of the week.
>> Ralph
>> On 3/4/08 4:00 PM, "Greg Watson" <g.watson_at_[hidden]> wrote:
>>> I don't have a problem using a different interface, assuming it's
>>> adequately supported and provides the functionality we need. I
>>> presume
>>> the recursive behavior you're referring to is calling OMPI interfaces
>>> from the callback functions. Any event-based system has this issue,
>>> and it is usually solved by clearly specifying the allowable
>>> interfaces that can be called (possibly none). Since PTP doesn't call
>>> OMPI functions from callbacks, it's not a problem for us if no
>>> interfaces can be called.
>>> The major missing features appear to be:
>>> - Ability to request a process allocation without launching the job
>>> - I/O forwarding callbacks
>>> Without these, PTP support will be so limited that I'd be reluctant
>>> to
>>> say we support OMPI.
>>> Greg
>>> On Mar 4, 2008, at 4:50 PM, Ralph H Castain wrote:
>>>> It is buried deep-down in the thread, but I'll just reiterate it
>>>> here. I
>>>> have "restored" the ability to "subscribe" to changes in job, proc,
>>>> and node
>>>> state via OMPI's tool interface library. I have -not- checked this
>>>> into the
>>>> trunk yet, though, until the community has a chance to consider
>>>> whether or
>>>> not it wants it.
>>>> Restoring the ability to have such changes "callback" to user
>>>> functions
>>>> raises the concern again about recursive behavior. We worked hard to
>>>> remove
>>>> recursion from the code base, and it would be a concern to see it
>>>> potentially re-enter.
>>>> I realize there is some difference between ORTE calling back into
>>>> itself vs
>>>> calling back into a user-specified function. However, unless that
>>>> user truly
>>>> understands ORTE/OMPI and takes considerable precautions, it is very
>>>> easy to
>>>> recreate the recursive behavior without intending to do so.
>>>> The tool interface library was built to accomplish two things:
>>>> 1. help reduce the impact on external tools of changes to ORTE/OMPI
>>>> interfaces, and
>>>> 2. provide a degree of separation to prevent the tool from
>>>> inadvertently
>>>> causing OMPI to "behave badly"
>>>> I think we accomplished that - I would encourage you to at least
>>>> consider
>>>> using the library. If there is something missing, we can always add
>>>> it.
>>>> Ralph
>>>> On 3/4/08 2:37 PM, "Jeff Squyres" <jsquyres_at_[hidden]> wrote:
>>>>> Greg --
>>>>> I admit to being a bit puzzled here. Ralph sent around RFCs about
>>>>> these changes many months ago. Everyone said they didn't want this
>>>>> functionality -- it was seen as excess functionality that Open MPI
>>>>> didn't want or need -- so it was all removed.
>>>>> As such, I have to agree with Ralph that it is an "enhancement" to
>>>>> re-
>>>>> add the functionality. That being said, patches are always
>>>>> welcome!
>>>>> IBM has signed the OMPI 3rd party contribution agreement, so it
>>>>> could
>>>>> be contributed directly.
>>>>> Sidenote: I was also under the impression that PTP was being re-
>>>>> geared
>>>>> towards STCI and moving away from ORTE anyway. Is this incorrect?
>>>>> On Mar 4, 2008, at 3:24 PM, Greg Watson wrote:
>>>>>> Hi all,
>>>>>> Ralph informs me that significant functionality has been removed
>>>>>> from
>>>>>> ORTE in 1.3. Unfortunately this functionality was being used by
>>>>>> PTP to
>>>>>> provide support for OMPI, and without it, it seems unlikely that
>>>>>> PTP
>>>>>> will be able to work with 1.3. Apparently restoring this lost
>>>>>> functionality is an "enhancement" of 1.3, and so is something that
>>>>>> will not necessarily be done. Having worked with OMPI from a very
>>>>>> early stage to ensure that we were able to provide robust
>>>>>> support, I
>>>>>> must say it is a bit disappointing that this approach is being
>>>>>> taken.
>>>>>> I hope that the community will view this "enhancement" as
>>>>>> worthwhile.
>>>>>> Regards,
>>>>>> Greg
>>>>>> Begin forwarded message:
>>>>>>> On 2/29/08 7:13 AM, "Gregory R Watson" <grw_at_[hidden]> wrote:
>>>>>>>> Ralph Castain <rhc_at_[hidden]> wrote on 02/29/2008 12:18:39 AM:
>>>>>>>>> Ralph Castain <rhc_at_[hidden]>
>>>>>>>>> 02/29/08 12:18 AM
>>>>>>>>> To
>>>>>>>>> Gregory R Watson/Watson/IBM_at_IBMUS
>>>>>>>>> cc
>>>>>>>>> Subject
>>>>>>>>> Re: OpenMPI changes
>>>>>>>>> Hi Greg
>>>>>>>>> All of the prior options (and some new ones) for spawning a job
>>>>>>> are fully
>>>>>>>>> supported in the new interface. Instead of setting them with
>>>>>>> "attributes",
>>>>>>>>> you create an orte_job_t object and just fill them in. This is
>>>>>>> precisely how
>>>>>>>>> mpirun does it - you can look at that code if you want an
>>>>>>> example, though it
>>>>>>>>> is somewhat complex. Alternatively, you can look at the way
>>>>>>>>> it is
>>>>>>> done for
>>>>>>>>> comm_spawn, which may be more analogous to your situation -
>>>>>>>>> that
>>>>>>> code is in
>>>>>>>>> ompi/mca/dpm/orte.
>>>>>>>>> All the tools library does is communicate the job object to the
>>>>>>> target
>>>>>>>>> persistent daemon so it can do the work. This way, you don't
>>>>>>>>> have
>>>>>>> to open
>>>>>>>>> all the frameworks, deal directly with the plm interface, etc.
>>>>>>>>> Alternatively, you are welcome to do a full orte_init and use
>>>>>>>>> the
>>>>>>> frameworks
>>>>>>>>> yourself - there is no requirement to use the library. I only
>>>>>>> offer it as an
>>>>>>>>> alternative.
>>>>>>>> As far as I can tell, neither API provides the same
>>>>>>>> functionality
>>>>>>> as that
>>>>>>>> available in 1.2. While this might be beneficial for OMPI-
>>>>>>>> specific
>>>>>>> activities,
>>>>>>>> the changes appear to severely limit the interaction of tools
>>>>>>>> with
>>>>>>> the
>>>>>>>> runtime. At this point, I can't see either interface supporting
>>>>>>>> PTP.
>>>>>>> I went ahead and added a notification capability to the system -
>>>>>>> took about
>>>>>>> 30 minutes. I can provide notice of job and process state changes
>>>>>>> since I
>>>>>>> see those. Node state changes, however, are different - I can
>>>>>>> notify
>>>>>>> on
>>>>>>> them, but we have no way of seeing them. None of the environments
>>>>>>> we
>>>>>>> support
>>>>>>> tell us when a node fails.
>>>>>>>>> I know that the tool library works because it uses the
>>>>>>>>> identical
>>>>>>> APIs as
>>>>>>>>> comm_spawn and mpirun. I have also tested them by building my
>>>>>>>>> own
>>>>>>> tools.
>>>>>>>> There's a big difference being on a code path that *must* work
>>>>>>> because it is
>>>>>>>> used by core components, to one that is provided as an add-on
>>>>>>>> for
>>>>>>> external
>>>>>>>> tools. I may be worrying needlessly if this new interface
>>>>>>>> becomes an
>>>>>>>> "officially supported" API. Is that planned? At a minimum, it
>>>>>>> seems like it's
>>>>>>>> going to complicate your testing process, since you're going to
>>>>>>> need to
>>>>>>>> provide a separate set of tests that exercise this interface
>>>>>>> independent of
>>>>>>>> the rest of OMPI.
>>>>>>> It is an officially supported API. Testing is not as big a
>>>>>>> problem
>>>>>>> as you
>>>>>>> might expect since the library exercises the same code paths as
>>>>>>> mpirun and
>>>>>>> comm_spawn. Like I said, I have written my own tools that
>>>>>>> exercise
>>>>>>> the
>>>>>>> library - no problem using them as tests.
>>>>>>>>> We do not launch an orted for any tool-library query. All we do
>>>>>>>>> is
>>>>>>>>> communicate the query to the target persistent daemon or
>>>>>>>>> mpirun.
>>>>>>> Those
>>>>>>>>> entities have recv's posted to catch any incoming messages and
>>>>>>> execute the
>>>>>>>>> request.
>>>>>>>>> You are correct that we no longer have event driven
>>>>>>>>> notification
>>>>>>> in the
>>>>>>>>> system. I repeatedly asked the community (on both devel and
>>>>>>>>> core
>>>>>>> lists) for
>>>>>>>>> input on that question, and received no indications that anyone
>>>>>>> wanted it
>>>>>>>>> supported. It can be added back into the system, but would
>>>>>>> require the
>>>>>>>>> approval of the OMPI community. I don't know how problematic
>>>>>>>>> that
>>>>>>> would be -
>>>>>>>>> there is a lot of concern over the amount of memory, overhead,
>>>>>>> and potential
>>>>>>>>> reliability issues that surround event notification. If you
>>>>>>>>> want
>>>>>>> that
>>>>>>>>> capability, I suggest we discuss it, come up with a plan that
>>>>>>> deals with
>>>>>>>>> those issues, and then take a proposal to the devel list for
>>>>>>> discussion.
>>>>>>>>> As for reliability, the objectives of the last year's effort
>>>>>>>>> were
>>>>>>> precisely
>>>>>>>>> scalability and reliability. We did a lot of work to eliminate
>>>>>>> recursive
>>>>>>>>> deadlocks and improve the reliability of the code. Our current
>>>>>>> testing
>>>>>>>>> indicates we had considerable success in that regard,
>>>>>>> particularly with the
>>>>>>>>> recursion elimination commit earlier today.
>>>>>>>>> I would be happy to work with you to meet the PTP's needs -
>>>>>>>>> we'll
>>>>>>> just need
>>>>>>>>> to work with the OMPI community to ensure everyone buys into
>>>>>>>>> the
>>>>>>> plan. If it
>>>>>>>>> would help, I could come and review the new arch with the
>>>>>>>>> team (I
>>>>>>> already
>>>>>>>>> gave a presentation on it to IBM Rochester MN) and discuss
>>>>>>>>> required
>>>>>>>>> enhancements.
>>>>>>>> PTP's needs have not changed since 1.0. From our perspective,
>>>>>>>> the
>>>>>>> 1.3 branch
>>>>>>>> simply removes functionality that is required for PTP to support
>>>>>>> OMPI. It
>>>>>>>> seems strange that we need "approval of the OMPI community" to
>>>>>>> continue to use
>>>>>>>> functionality that has been available since 1.0. In any case,
>>>>>>> there are
>>>>>>>> unfortunately no resources to work on the kind of re-engineering
>>>>>>> that appears
>>>>>>>> to be required to support 1.3, even if it did provide the
>>>>>>> functionality we
>>>>>>>> need.
>>>>>>> Afraid I have to be driven by the OMPI community's requirements
>>>>>>> since they
>>>>>>> pay my salary :-) What they need is a "lean, mean, OMPI machine"
>>>>>>> as
>>>>>>> they
>>>>>>> say, and (for some reason) they view the debugger community as
>>>>>>> consisting of
>>>>>>> folks like totalview, vampirtrace, etc. - all of whom get
>>>>>>> involved
>>>>>>> (either
>>>>>>> directly or via one of the OMPI members) in the requirements
>>>>>>> discussions.
>>>>>>> Can't argue with business decisions, though. I gather there was
>>>>>>> some
>>>>>>> mention
>>>>>>> of PTP at the recent LANL/IBM RR meeting, so I'll let people know
>>>>>>> that PTP
>>>>>>> won't be an option on RR.
>>>>>>> And I'll see if there is any interest here in adding 1.3
>>>>>>> support to
>>>>>>> PTP
>>>>>>> ourselves - from looking at your code, I think it would take
>>>>>>> about a
>>>>>>> day,
>>>>>>> assuming someone more familiar with PTP will work with me.
>>>>>>> Take care
>>>>>>> Ralph
>>>>>>>> Greg
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> devel_at_[hidden]
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
> _______________________________________________
> devel mailing list
> devel_at_[hidden]