My thought is that if add_procs fails, then that BTL should be removed
(as if init failed) and things should continue on. If that BTL was
the only way to reach another process, we'll catch that later and abort.
There are always going to be errors that can't be detected until the
device is actually used, so I think that add_procs errors should be
treated the same as init errors. The error_cb is a red herring, as
that's supposed to be used in situations where an error can't directly
be returned to the upper layers (like the progress function). In this
case, we can directly return an error, so we should do so (and I
believe we do, it's the BML/PML that's the problem).
On Aug 1, 2008, at 8:03 PM, Jeff Squyres wrote:
> I wasted a bunch of time today debugging a scenario where openib-
> >add_procs() was (legitimately) failing during MPI_INIT.
> Specifically: an openib BTL module had successfully been
> initialized, but then was failing during add_procs(). I say
> "legitimately" failing because something external was causing
> add_procs to fail (i.e., a misconfiguration on my cluster). By
> "fail", I mean add_procs() returned != OMPI_SUCCESS.
> The problem is that OMPI does not handle this situation gracefully;
> every MPI process dumped core.
> My question is: what exactly should happen when BTL add_procs()
> fails? Is the BTL expected to recover? What if the BTL has no
> procs as a result of this failure; should the PML (or BML) remove it
> from progress loops? Or should the BTL be able to handle if
> progress is called on its component? (which seems kinda wasteful)
> Or should the failure of add_procs() be a fatal error? If so, what
> should the BTL do? The PML's error_cb has not yet been registered,
> and returning != OMPI_SUCCESS does not [currently] cause the PML to
> abort. This fact seems to indicate to me that the PML/BTL designers
> envisioned that the MPI process should be able to continue. But I'm
> not sure that I agree with that assessment: we have a successfully
> initialized BTL module, but an error occurred during add_procs().
> Shouldn't we gracefully abort?
> My $0.02:
> - if the BTL returns != OMPI_SUCCESS from add_procs(), the PML
> should gracefully abort.
> - if a BTL fails add_procs() in a non-fatal way, it can set all
> reachable bits to 0 and return OMPI_SUCCESS. The PML will therefore
> effectively ignore it.
> Comments? I'd like to fix the openib btl's add_procs() one way or
> another for v1.3.
> Jeff Squyres
> Cisco Systems
> devel mailing list