Yes, this is a problem, but not quite in the way you describe (I think
a hodgepodge of BTLs for final connectivity is fine).
I found similar issues a while ago if the openib BTL opens properly
but then fails in add_procs() for some reason. Check out these
tickets -- 1434 points to some discussion on the list about what to do:
I'm not sure I'm convinced by the argument in 1436 anymore, though...
On Dec 11, 2008, at 9:35 PM, Eugene Loh wrote:
> I'm not exactly sure where the fix to this should be, but I think
> I've found a problem.
> Consider, for illustration, launching a multi-process job on a
> single node. The function
> Each process could conceivably return a different value --
> OMPI_SUCCESS or otherwise. E.g., if there isn't enough room for all
> to allocate all the shared memory they need, early processes might
> succeed in their allocations while laggards won't.
> The fact that some processes fail doesn't bother the BML. It just
> loops over other BTLs and, quite possibly, finds another BTL to make
> needed connections.
> Is this a problem? It seems to me to be, but I haven't yet figured
> out what the BML does next. I'm guessing it ends up with a
> hodgepodge of BTLs. E.g., A talks to B via sm, but B talks to A via
> tcp. And, I'm still guessing, this produces badness (like hangs).
> devel mailing list