Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] if btl->add_procs() fails...?
From: Brian Barrett (brbarret_at_[hidden])
Date: 2008-08-04 09:52:12

On Aug 4, 2008, at 9:40 AM, Jeff Squyres wrote:

> On Aug 2, 2008, at 2:34 PM, Brian Barrett wrote:
>>> I am curious how all of the above affects client/server or spawned
>>> jobs. If you finalize a BTL then do a connect to a process that
>>> would use that BTL would it reinitialize itself?
>> To deal with all the dynamics issues, I wouldn't finalized the BTL.
>> The BML should handle the progress stuff, just as if the add_procs
>> succeeded but returned no active peers. But I'd guess that's part
>> of the bit that doesn't work today. I would further suspect that a
>> BTL will need to have a working progress function in the face of
>> add_procs failures to cope with all the dynamics options. I'm
>> travelling this weekend, so I can't verify any of this at the moment.
> This seems a little different than the rest of the code base --
> we're talking about having the BTL return an error but have the
> upper level not treat it as a fatal error.
> I think we actually have a few different situations ("fail" means
> "not returning OMPI_SUCCESS"):
> 1. btl component init fails (only during MPI_INIT): the API supports
> no notion of failure -- it either returns modules or not (i.e.,
> valid pointers or NULL). If NULL is returned, the component is
> ignored and unloaded.
> 2. btl add_procs during MPI_INIT fails: this is under debate
> 3. btl add_procs during MPI-2 dynamics fails: this is under debate
> For #2 and #3, I suspect that only the BTL knows if it can continue
> or not. For example, a failure in #3 may cause the entire BTL to be
> hosed such that it can no longer communicate with procs that it
> previously successfully added (e.g., in MPI_INIT). So we really
> need add_procs to be able to return multiple things:
> A. OMPI_SUCCESS / all was good
> B. a non-fatal error occurred such that this BTL cannot communicate
> with the desired peers, but the upper level PML can continue
> C. a fatal error has occurred such that the upper level should abort
> (or, more specifically, do whatever the error manager says)
> I think that for B in both #2 and #3, we can just have the BTL set
> all the reachability bits to 0 and return OMPI_SUCCESS. But for C,
> the BTL should return != OMPI_SUCCESS. The PML should treat it as a
> fatal error and therefore call the error manager.
> I think that this is in-line with Brian's original comments, right?

I suppose, but that's a pain when you just want to say "I don't
support calling add_procs a second time" :). But I'm not going to fix
all the BTLs to make that work right, so I suppose in the end I really
don't have a strong opinion.