Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] about btl/scif thread cancellation (#4616 / r31738)
From: Nathan Hjelm (hjelmn_at_[hidden])
Date: 2014-05-14 10:45:47


Looks like this is a scif bug. From the documentation:

scif_poll() waits for one of a set of endpoints to become ready to perform an I/O operation;
it is syntactically and semantically very similar to poll() . The SCIF functions on which
scif_poll() waits are scif_accept(), scif_send(), and scif_recv(). Consult the SCIF
API reference manuals for details on scif_poll() usage.

So, if it is indeed similar to poll() it should wake up when the file
descriptor is closed.

Since that is not the case I will look through the documentation and see
if there is a way other than pthread_cancel.

-Nathan

On Wed, May 14, 2014 at 11:18:05AM +0900, Gilles Gouaillardet wrote:
> Folks,
>
> i would like to comment on r31738 :
>
> > There is no reason to cancel the listening thread. It should die
> > automatically when the file descriptor is closed.
> i could not agree more
> > It is sufficient to just wait for the thread to exit with pthread join.
> unfortunatly, at least in my test environment (an outdated MPSS 2.1) it
> is *not* :-(
>
> this is what i described in #4615
> https://svn.open-mpi.org/trac/ompi/ticket/4615
> in which i attached scif_hang.c that evidences that (at least in my
> environment)
> scif_poll(...) does *not* return after scif_close(...) is closed, and
> hence the scif pthread never ends.
>
> this is likely a bug in MPSS and it might have been fixed in earlier
> release.
>
> Nathan, could you try scif_hang in your environment and report the MPSS
> version you are running ?
>
>
> bottom line, and once again, in my test environment, pthread_join (...)
> without pthread_cancel(...)
> might cause a hang when the btl/scif module is released.
>
>
> assuming the bug is in old MPSS and has been fixed in recent releases,
> what is the OpenMPI policy ?
> a) test the MPSS version and call pthread_cancel() or do *not* call
> pthread_join if buggy MPSS is detected ?
> b) display an error/warning if a buggy MPSS is detected ?
> c) do not call pthread_join at all ? /* SIGSEGV might occur with older
> MPSS, it is in MPI_Finalize() so impact is limited */
> d) do nothing, let the btl/scif module hang, this is *not* an OpenMPI
> problem after all ?
> e) something else ?
>
> Gilles
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: http://www.open-mpi.org/community/lists/devel/2014/05/14786.php



  • application/pgp-signature attachment: stored