Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] about btl/scif thread cancellation (#4616 / r31738)
From: George Bosilca (bosilca_at_[hidden])
Date: 2014-05-14 11:10:28


There seems to be a consensus on the fact that closing an fd should trigger the return from poll. Unfortunately this assumption is wrong, and not condoned by any documentation available online.

To be more clear, all documentation I know tend to point in the opposite direction: it is unwise to close a socket some other thread is polling onto. As an example on the Linux close man page there is a warning about this usage:
> It is probably unwise to close file descriptors while they may be in use by system calls in other threads in the same process. Since a file descriptor may be reused, there are some obscure race conditions that may cause unintended side effects.

Extra info available at http://stackoverflow.com/questions/10561602/closing-a-file-descriptor-that-is-being-polled

  George.

On May 13, 2014, at 22:18 , Gilles Gouaillardet <gilles.gouaillardet_at_[hidden]> wrote:

> Folks,
>
> i would like to comment on r31738 :
>
>> There is no reason to cancel the listening thread. It should die
>> automatically when the file descriptor is closed.
> i could not agree more
>> It is sufficient to just wait for the thread to exit with pthread join.
> unfortunatly, at least in my test environment (an outdated MPSS 2.1) it
> is *not* :-(
>
> this is what i described in #4615
> https://svn.open-mpi.org/trac/ompi/ticket/4615
> in which i attached scif_hang.c that evidences that (at least in my
> environment)
> scif_poll(...) does *not* return after scif_close(...) is closed, and
> hence the scif pthread never ends.
>
> this is likely a bug in MPSS and it might have been fixed in earlier
> release.
>
> Nathan, could you try scif_hang in your environment and report the MPSS
> version you are running ?
>
>
> bottom line, and once again, in my test environment, pthread_join (...)
> without pthread_cancel(...)
> might cause a hang when the btl/scif module is released.
>
>
> assuming the bug is in old MPSS and has been fixed in recent releases,
> what is the OpenMPI policy ?
> a) test the MPSS version and call pthread_cancel() or do *not* call
> pthread_join if buggy MPSS is detected ?
> b) display an error/warning if a buggy MPSS is detected ?
> c) do not call pthread_join at all ? /* SIGSEGV might occur with older
> MPSS, it is in MPI_Finalize() so impact is limited */
> d) do nothing, let the btl/scif module hang, this is *not* an OpenMPI
> problem after all ?
> e) something else ?
>
> Gilles
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: http://www.open-mpi.org/community/lists/devel/2014/05/14786.php