As you noticed add_procs only add processes to the list of available
processes without trying to setup any connections to them. As a result
when we return from add_proc it is very unlikely that we will be able
to accurately detect any connection problems.
The connections are established later on, when the first message is
sent to a peer. If this message cannot use a BTL, it will try the next
one that reported a possible connection to the peer until no more BTL
are available for the peer. Then, finally we give up and return a
On Dec 11, 2008, at 21:35 , Eugene Loh wrote:
> I'm not exactly sure where the fix to this should be, but I think
> I've found a problem.
> Consider, for illustration, launching a multi-process job on a
> single node. The function
> Each process could conceivably return a different value --
> OMPI_SUCCESS or otherwise. E.g., if there isn't enough room for all
> to allocate all the shared memory they need, early processes might
> succeed in their allocations while laggards won't.
> The fact that some processes fail doesn't bother the BML. It just
> loops over other BTLs and, quite possibly, finds another BTL to make
> needed connections.
> Is this a problem? It seems to me to be, but I haven't yet figured
> out what the BML does next. I'm guessing it ends up with a
> hodgepodge of BTLs. E.g., A talks to B via sm, but B talks to A via
> tcp. And, I'm still guessing, this produces badness (like hangs).
> devel mailing list