On Thu, 27 May 2010, Jeff Squyres wrote:
> On May 27, 2010, at 10:32 AM, Sylvain Jeaugey wrote:
>> That's pretty much my first proposition : abort when an error arises,
>> because if we don't, we'll crash soon afterwards. That's my original
>> concern and this should really be fixed.
>> Now, if you want to fix the openib BTL so that an error in IB results in
>> an elegant fallback on TCP (elegant = notified ;-)), then hooray.
> You're specifically referring to the point where the openib btl sets the
> reachable bit, but then later decides "oops, an error occurred, so
> return !=OMPI_SUCCESS" -- and assume that the openib btl is not called
> If so, then yes, that's a bug. The openib btl should be fixed to unset
> the reachable bit(s) that it just set before returning the error.
> Or the BML could assume that !=OMPI_SUCCESS codes means that the
> reachable bits it got back were invalid and should be ignored.
> I'd lead towards the former.
> Can you file and bug and submit a patch?
I'd like to (though I don't have an svn account), but some things
Having errors on add_procs stop the application seems a good thing in all
cases, so why not do it ? That would solve my original problem which lead
to this discussion.
Yes, the openib BTL may be suboptimal (stopping the application instead of
nicely returning), but I'm fine with that, so I'm not very inclined to
spend time on this.