Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] BML problem?
From: George Bosilca (bosilca_at_[hidden])
Date: 2008-12-12 17:23:53


As you noticed add_procs only add processes to the list of available
processes without trying to setup any connections to them. As a result
when we return from add_proc it is very unlikely that we will be able
to accurately detect any connection problems.

The connections are established later on, when the first message is
sent to a peer. If this message cannot use a BTL, it will try the next
one that reported a possible connection to the peer until no more BTL
are available for the peer. Then, finally we give up and return a
fatal error.


On Dec 11, 2008, at 21:35 , Eugene Loh wrote:

> I'm not exactly sure where the fix to this should be, but I think
> I've found a problem.
> Consider, for illustration, launching a multi-process job on a
> single node. The function
> mca_bml_r2_add_procs()
> calls
> mca_btl_sm_add_procs()
> Each process could conceivably return a different value --
> OMPI_SUCCESS or otherwise. E.g., if there isn't enough room for all
> to allocate all the shared memory they need, early processes might
> succeed in their allocations while laggards won't.
> The fact that some processes fail doesn't bother the BML. It just
> loops over other BTLs and, quite possibly, finds another BTL to make
> needed connections.
> Is this a problem? It seems to me to be, but I haven't yet figured
> out what the BML does next. I'm guessing it ends up with a
> hodgepodge of BTLs. E.g., A talks to B via sm, but B talks to A via
> tcp. And, I'm still guessing, this produces badness (like hangs).
> Comments?
> _______________________________________________
> devel mailing list
> devel_at_[hidden]