Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: [OMPI devel] about btl/scif thread cancellation (#4616 / r31738)
From: Gilles Gouaillardet (gilles.gouaillardet_at_[hidden])
Date: 2014-05-13 22:18:05


Folks,

i would like to comment on r31738 :

> There is no reason to cancel the listening thread. It should die
> automatically when the file descriptor is closed.
i could not agree more
> It is sufficient to just wait for the thread to exit with pthread join.
unfortunatly, at least in my test environment (an outdated MPSS 2.1) it
is *not* :-(

this is what i described in #4615
https://svn.open-mpi.org/trac/ompi/ticket/4615
in which i attached scif_hang.c that evidences that (at least in my
environment)
scif_poll(...) does *not* return after scif_close(...) is closed, and
hence the scif pthread never ends.

this is likely a bug in MPSS and it might have been fixed in earlier
release.

Nathan, could you try scif_hang in your environment and report the MPSS
version you are running ?

bottom line, and once again, in my test environment, pthread_join (...)
without pthread_cancel(...)
might cause a hang when the btl/scif module is released.

assuming the bug is in old MPSS and has been fixed in recent releases,
what is the OpenMPI policy ?
a) test the MPSS version and call pthread_cancel() or do *not* call
pthread_join if buggy MPSS is detected ?
b) display an error/warning if a buggy MPSS is detected ?
c) do not call pthread_join at all ? /* SIGSEGV might occur with older
MPSS, it is in MPI_Finalize() so impact is limited */
d) do nothing, let the btl/scif module hang, this is *not* an OpenMPI
problem after all ?
e) something else ?

Gilles