Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Non-uniform BTL problems in: openib, tcp, sctp, portals4, vader, scif
From: George Bosilca (bosilca_at_[hidden])
Date: 2014-05-14 09:51:05


Good catch. I fixed the TCP BTL (r31753). It is the only BTL I can
test so that's the most I can do here.

However, I never get OPAL_ERR_DATA_VALUE_NOT_FOUND out of the modex
call when the key doesn't exists. I looked in dstore and the correct
value one should look for is OPAL_ERR_NOT_FOUND. I guess you might
want to revise the check in the USNIC.

  George.

PS: There is a easy way to test this particular case by using the MPMD
capabilities of mpiexec. As an example for a quick NetPIPE run between
two processes one supporting SM and TCP and one supporting only SM (I
ignored self here), you can do:

mpirun -np 1 --mca btl tcp,sm,self ./NPmpi -l 5 -u 5 : -np 1 --mca btl
sm,self ./NPmpi -l 5 -u 5

On Tue, May 13, 2014 at 2:09 PM, Jeff Squyres (jsquyres)
<jsquyres_at_[hidden]> wrote:
> I notice that BTLs are not checking the return value from ompi_modex_recv() for OPAL_ERR_DATA_VALUE_NOT_FOUND (indicating that the peer process didn't put that modex key). In the BTL context, NOT_FOUND means that that peer process doesn't have this BTL, so this local peer process should probably mark it as unreachable in add_procs().
>
> This is on both trunk and the v1.8 branch.
>
> The BTLs listed above are not checking/handling ompi_modex_recv() returning OPAL_ERR_DATA_VALUE_NOT_FOUND properly. Most of these BTLs do something like this:
>
> -----
> module_add_procs() {
> loop over the peers {
> proc = proc_create(...)
> if (NULL == proc)
> error!
> ....
> }
> }
>
> proc_create(...) {
> if (ompi_modex_recv() != OMPI_SUCCESS)
> return NULL;
> ...
> }
> -----
>
> The fix is to make proc_create() return something a bit more expressive so that add_procs() can tell the difference between "error!" and "you can't reach this peer".
>
> I fixed this in the usnic BTL back in late March, but forgot to bring this to everyone's attention -- oops. See https://svn.open-mpi.org/trac/ompi/ticket/4442
>
> --
> Jeff Squyres
> jsquyres_at_[hidden]
> For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: http://www.open-mpi.org/community/lists/devel/2014/05/14783.php