Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] BML problem?
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2008-12-12 08:59:22


Yes, this is a problem, but not quite in the way you describe (I think
a hodgepodge of BTLs for final connectivity is fine).

I found similar issues a while ago if the openib BTL opens properly
but then fails in add_procs() for some reason. Check out these
tickets -- 1434 points to some discussion on the list about what to do:

     https://svn.open-mpi.org/trac/ompi/ticket/1434
and
     https://svn.open-mpi.org/trac/ompi/ticket/1436

I'm not sure I'm convinced by the argument in 1436 anymore, though...

On Dec 11, 2008, at 9:35 PM, Eugene Loh wrote:

> I'm not exactly sure where the fix to this should be, but I think
> I've found a problem.
>
> Consider, for illustration, launching a multi-process job on a
> single node. The function
>
> mca_bml_r2_add_procs()
>
> calls
>
> mca_btl_sm_add_procs()
>
> Each process could conceivably return a different value --
> OMPI_SUCCESS or otherwise. E.g., if there isn't enough room for all
> to allocate all the shared memory they need, early processes might
> succeed in their allocations while laggards won't.
>
> The fact that some processes fail doesn't bother the BML. It just
> loops over other BTLs and, quite possibly, finds another BTL to make
> needed connections.
>
> Is this a problem? It seems to me to be, but I haven't yet figured
> out what the BML does next. I'm guessing it ends up with a
> hodgepodge of BTLs. E.g., A talks to B via sm, but B talks to A via
> tcp. And, I'm still guessing, this produces badness (like hangs).
>
> Comments?
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 
Jeff Squyres
Cisco Systems