On May 28, 2010, at 6:04 AM, Sylvain Jeaugey wrote:
> Having errors on add_procs stop the application seems a good thing in all
> cases, so why not do it ? That would solve my original problem which lead
> to this discussion.
>
> Yes, the openib BTL may be suboptimal (stopping the application instead of
> nicely returning), but I'm fine with that, so I'm not very inclined to
> spend time on this.
Herein lies the quandary: we don't/can't know the user or sysadmin intent. They may not care if the IB is borked -- they might just want the job to fall over to TCP and continue. But they may care a lot if IB is borked -- they might want the job to abort (because it would be too slow over TCP).
So I don't think it's a good idea to always abort if a single BTL is busted. The typical Open MPI Way is to introduce an MCA parameter that lets the user / sysadmin choose which behavior they want.
--
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/
|