Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] BTL add procs errors
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2010-05-28 06:16:07


On May 28, 2010, at 6:04 AM, Sylvain Jeaugey wrote:

> Having errors on add_procs stop the application seems a good thing in all
> cases, so why not do it ? That would solve my original problem which lead
> to this discussion.
>
> Yes, the openib BTL may be suboptimal (stopping the application instead of
> nicely returning), but I'm fine with that, so I'm not very inclined to
> spend time on this.

Herein lies the quandary: we don't/can't know the user or sysadmin intent. They may not care if the IB is borked -- they might just want the job to fall over to TCP and continue. But they may care a lot if IB is borked -- they might want the job to abort (because it would be too slow over TCP).

So I don't think it's a good idea to always abort if a single BTL is busted. The typical Open MPI Way is to introduce an MCA parameter that lets the user / sysadmin choose which behavior they want.

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/