On May 27, 2010, at 10:32 AM, Sylvain Jeaugey wrote:
> That's pretty much my first proposition : abort when an error arises,
> because if we don't, we'll crash soon afterwards. That's my original
> concern and this should really be fixed.
>
> Now, if you want to fix the openib BTL so that an error in IB results in
> an elegant fallback on TCP (elegant = notified ;-)), then hooray.
You're specifically referring to the point where the openib btl sets the reachable bit, but then later decides "oops, an error occurred, so return !=OMPI_SUCCESS" -- and assume that the openib btl is not called again.
Right?
If so, then yes, that's a bug. The openib btl should be fixed to unset the reachable bit(s) that it just set before returning the error.
Or the BML could assume that !=OMPI_SUCCESS codes means that the reachable bits it got back were invalid and should be ignored.
I'd lead towards the former.
Can you file and bug and submit a patch?
--
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/
|