Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] BTL add procs errors
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2010-05-27 15:19:25


On May 27, 2010, at 10:32 AM, Sylvain Jeaugey wrote:

> That's pretty much my first proposition : abort when an error arises,
> because if we don't, we'll crash soon afterwards. That's my original
> concern and this should really be fixed.
>
> Now, if you want to fix the openib BTL so that an error in IB results in
> an elegant fallback on TCP (elegant = notified ;-)), then hooray.

You're specifically referring to the point where the openib btl sets the reachable bit, but then later decides "oops, an error occurred, so return !=OMPI_SUCCESS" -- and assume that the openib btl is not called again.

Right?

If so, then yes, that's a bug. The openib btl should be fixed to unset the reachable bit(s) that it just set before returning the error.

Or the BML could assume that !=OMPI_SUCCESS codes means that the reachable bits it got back were invalid and should be ignored.

I'd lead towards the former.

Can you file and bug and submit a patch?

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/