Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: [OMPI devel] Non-uniform BTL problems in: openib, tcp, sctp, portals4, vader, scif
From: Jeff Squyres (jsquyres) (jsquyres_at_[hidden])
Date: 2014-05-13 14:09:45


I notice that BTLs are not checking the return value from ompi_modex_recv() for OPAL_ERR_DATA_VALUE_NOT_FOUND (indicating that the peer process didn't put that modex key). In the BTL context, NOT_FOUND means that that peer process doesn't have this BTL, so this local peer process should probably mark it as unreachable in add_procs().

This is on both trunk and the v1.8 branch.

The BTLs listed above are not checking/handling ompi_modex_recv() returning OPAL_ERR_DATA_VALUE_NOT_FOUND properly. Most of these BTLs do something like this:

-----
module_add_procs() {
  loop over the peers {
    proc = proc_create(...)
    if (NULL == proc)
      error!
    ....
  }
}

proc_create(...) {
  if (ompi_modex_recv() != OMPI_SUCCESS)
     return NULL;
  ...
}
-----

The fix is to make proc_create() return something a bit more expressive so that add_procs() can tell the difference between "error!" and "you can't reach this peer".

I fixed this in the usnic BTL back in late March, but forgot to bring this to everyone's attention -- oops. See https://svn.open-mpi.org/trac/ompi/ticket/4442

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/