On Fri, 28 May 2010, Jeff Squyres wrote:
> Herein lies the quandary: we don't/can't know the user or sysadmin
> intent. They may not care if the IB is borked -- they might just want
> the job to fall over to TCP and continue. But they may care a lot if IB
> is borked -- they might want the job to abort (because it would be too
> slow over TCP).
There is no intent nor choice : Open MPI today always crashes on such an
error. The thing is, we crash at the wrong place, which is why I'd like it
to stop on the real error rather than trying to continue and hide the real
problem within a ton of error traces.
Frankly, I don't know how to be clearer. The discussion started on a bug
and you moved it to a nice-feature-we-would-like-to-have.
So please, fix the bug first, then if you want that "automatic failover to
TCP" feature, develop it. Put a parameter for an eventual notification, or
abort, or whatever you want. But it doesn't exist today. It just doesn't
work, with any BTL. Errors reported by BTLs are all fatal.