As I said, this isn't the only thread that faces this issue, and we have resolved it elsewhere - surely we can resolve it here as well in an acceptable manner.
On May 13, 2014, at 7:33 PM, Gilles Gouaillardet <gilles.gouaillardet_at_[hidden]> wrote:
> scif_poll(...) is called with an infinite timeout.
> a quick fix would be to use a finite timeout (1s ? 10s ? more ?)
> the obvious drawback is the thread has to wake up every xxx seconds and
> that would be for
> nothing 99.9% of the time.
> my analysis (see #4615) is the crash occurs when the btl/scif is
> unloaded from memory (e.g. dlcose()) and
> the scif_thread is still running.
> On 2014/05/14 11:25, Ralph Castain wrote:
>> It could be a bug in the software stack, though I wouldn't count on it. Unfortunately, pthread_cancel is known to have bad side effects, and so we avoid its use.
>> The key here is that the thread must detect that the file descriptor has closed and exit, or use some other method for detecting that it should terminate. We do this in multiple other places in the code, without using pthread_cancel and without hanging. So it is certainly doable.
>> I don't know the specifics of why Nathan's code is having trouble exiting, but I suspect that a simple solution - not involving pthread_cancel - can be readily developed.
>> On May 13, 2014, at 7:18 PM, Gilles Gouaillardet <gilles.gouaillardet_at_[hidden]> wrote:
>>> i would like to comment on r31738 :
>>>> There is no reason to cancel the listening thread. It should die
>>>> automatically when the file descriptor is closed.
>>> i could not agree more
>>>> It is sufficient to just wait for the thread to exit with pthread join.
>>> unfortunatly, at least in my test environment (an outdated MPSS 2.1) it
>>> is *not* :-(
>>> this is what i described in #4615
>>> in which i attached scif_hang.c that evidences that (at least in my
>>> scif_poll(...) does *not* return after scif_close(...) is closed, and
>>> hence the scif pthread never ends.
>>> this is likely a bug in MPSS and it might have been fixed in earlier
>>> Nathan, could you try scif_hang in your environment and report the MPSS
>>> version you are running ?
>>> bottom line, and once again, in my test environment, pthread_join (...)
>>> without pthread_cancel(...)
>>> might cause a hang when the btl/scif module is released.
>>> assuming the bug is in old MPSS and has been fixed in recent releases,
>>> what is the OpenMPI policy ?
>>> a) test the MPSS version and call pthread_cancel() or do *not* call
>>> pthread_join if buggy MPSS is detected ?
>>> b) display an error/warning if a buggy MPSS is detected ?
>>> c) do not call pthread_join at all ? /* SIGSEGV might occur with older
>>> MPSS, it is in MPI_Finalize() so impact is limited */
>>> d) do nothing, let the btl/scif module hang, this is *not* an OpenMPI
>>> problem after all ?
>>> e) something else ?
>>> devel mailing list
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post: http://www.open-mpi.org/community/lists/devel/2014/05/14786.php
>> devel mailing list
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: http://www.open-mpi.org/community/lists/devel/2014/05/14787.php
> devel mailing list
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: http://www.open-mpi.org/community/lists/devel/2014/05/14789.php