Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] if btl->add_procs() fails...?
From: Brian Barrett (brbarret_at_[hidden])
Date: 2008-08-02 14:34:14

On Aug 2, 2008, at 11:46, Terry Dontje <Terry.Dontje_at_[hidden]> wrote:

> Jeff Squyres wrote:
>> On Aug 1, 2008, at 11:39 PM, Brian Barrett wrote:
>>> My thought is that if add_procs fails, then that BTL should be
>>> removed (as if init failed) and things should continue on. If
>>> that BTL was the only way to reach another process, we'll catch
>>> that later and abort.
>>> There are always going to be errors that can't be detected until
>>> the device is actually used, so I think that add_procs errors
>>> should be treated the same as init errors. The error_cb is a red
>>> herring, as that's supposed to be used in situations where an
>>> error can't directly be returned to the upper layers (like the
>>> progress function). In this case, we can directly return an
>>> error, so we should do so (and I believe we do, it's the BML/PML
>>> that's the problem).
>> So if add_procs() fails, do you think that the BML/PML should
>> finalize the module? That looks like an easy change to make.
>> Second, if there are no other successfully-add_proc()'ed modules
>> from that component, should the BTL's progress function be removed
>> from the list of progress functions? The real question is: if a
>> module add_procs() fails, do we mandate that it still must be safe
>> to call the component's progress function? I think you're saying
>> "yes", but just wanted to be sure. I don't know offhand how a
>> component's progress function is added to the list (can't check
>> ATM), so I'd have to dig into that a bit.
> I am curious how all of the above affects client/server or spawned
> jobs. If you finalize a BTL then do a connect to a process that
> would use that BTL would it reinitialize itself?

To deal with all the dynamics issues, I wouldn't finalized the BTL.
The BML should handle the progress stuff, just as if the add_procs
succeeded but returned no active peers. But I'd guess that's part of
the bit that doesn't work today. I would further suspect that a BTL
will need to have a working progress function in the face of
add_procs failures to cope with all the dynamics options. I'm
travelling this weekend, so I can't verify any of this at the moment.