Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI devel] if btl->add_procs() fails...?
From: Brian Barrett (brbarret_at_[hidden])
Date: 2008-08-04 09:52:12


On Aug 4, 2008, at 9:40 AM, Jeff Squyres wrote:

> On Aug 2, 2008, at 2:34 PM, Brian Barrett wrote:
>
>>> I am curious how all of the above affects client/server or spawned
>>> jobs. If you finalize a BTL then do a connect to a process that
>>> would use that BTL would it reinitialize itself?
>>
>> To deal with all the dynamics issues, I wouldn't finalized the BTL.
>> The BML should handle the progress stuff, just as if the add_procs
>> succeeded but returned no active peers. But I'd guess that's part
>> of the bit that doesn't work today. I would further suspect that a
>> BTL will need to have a working progress function in the face of
>> add_procs failures to cope with all the dynamics options. I'm
>> travelling this weekend, so I can't verify any of this at the moment.
>
>
> This seems a little different than the rest of the code base --
> we're talking about having the BTL return an error but have the
> upper level not treat it as a fatal error.
>
> I think we actually have a few different situations ("fail" means
> "not returning OMPI_SUCCESS"):
>
> 1. btl component init fails (only during MPI_INIT): the API supports
> no notion of failure -- it either returns modules or not (i.e.,
> valid pointers or NULL). If NULL is returned, the component is
> ignored and unloaded.
> 2. btl add_procs during MPI_INIT fails: this is under debate
> 3. btl add_procs during MPI-2 dynamics fails: this is under debate
>
> For #2 and #3, I suspect that only the BTL knows if it can continue
> or not. For example, a failure in #3 may cause the entire BTL to be
> hosed such that it can no longer communicate with procs that it
> previously successfully added (e.g., in MPI_INIT). So we really
> need add_procs to be able to return multiple things:
>
> A. OMPI_SUCCESS / all was good
> B. a non-fatal error occurred such that this BTL cannot communicate
> with the desired peers, but the upper level PML can continue
> C. a fatal error has occurred such that the upper level should abort
> (or, more specifically, do whatever the error manager says)
>
> I think that for B in both #2 and #3, we can just have the BTL set
> all the reachability bits to 0 and return OMPI_SUCCESS. But for C,
> the BTL should return != OMPI_SUCCESS. The PML should treat it as a
> fatal error and therefore call the error manager.
>
> I think that this is in-line with Brian's original comments, right?

I suppose, but that's a pain when you just want to say "I don't
support calling add_procs a second time" :). But I'm not going to fix
all the BTLs to make that work right, so I suppose in the end I really
don't have a strong opinion.

Brian