Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] about btl/scif thread cancellation (#4616 / r31738)
From: Paul Hargrove (phhargrove_at_[hidden])
Date: 2014-05-13 22:56:55


George,

Just my USD0.02:

With pthreads many system calls (mostly those that might block) become
"cancellation points" where the implementation checks if the callinf thread
has been cancelled.
This means that a thread making any of those calls may simply never return
(calling pthread_exit() internally), unless extra work has been done to
prevent this default behavior.
This makes it very hard to write code that properly cleans up its
resources, including (but not limited to) file descriptors and malloc()ed
memory.
Even if Open MPI is written very carefully, one cannot assume that all the
libraries it calls (and their dependencies, etc.) are written to properly
deal with cancellation.

-Paul

On Tue, May 13, 2014 at 7:32 PM, George Bosilca <bosilca_at_[hidden]> wrote:

> I heard multiple references to pthread_cancel being known to have bad
> side effects. Can somebody educate my on this topic please?
>
> Thanks,
> George.
>
>
>
> On Tue, May 13, 2014 at 10:25 PM, Ralph Castain <rhc_at_[hidden]> wrote:
> > It could be a bug in the software stack, though I wouldn't count on it.
> Unfortunately, pthread_cancel is known to have bad side effects, and so we
> avoid its use.
> >
> > The key here is that the thread must detect that the file descriptor has
> closed and exit, or use some other method for detecting that it should
> terminate. We do this in multiple other places in the code, without using
> pthread_cancel and without hanging. So it is certainly doable.
> >
> > I don't know the specifics of why Nathan's code is having trouble
> exiting, but I suspect that a simple solution - not involving
> pthread_cancel - can be readily developed.
> >
> >
> > On May 13, 2014, at 7:18 PM, Gilles Gouaillardet <
> gilles.gouaillardet_at_[hidden]> wrote:
> >
> >> Folks,
> >>
> >> i would like to comment on r31738 :
> >>
> >>> There is no reason to cancel the listening thread. It should die
> >>> automatically when the file descriptor is closed.
> >> i could not agree more
> >>> It is sufficient to just wait for the thread to exit with pthread join.
> >> unfortunatly, at least in my test environment (an outdated MPSS 2.1) it
> >> is *not* :-(
> >>
> >> this is what i described in #4615
> >> https://svn.open-mpi.org/trac/ompi/ticket/4615
> >> in which i attached scif_hang.c that evidences that (at least in my
> >> environment)
> >> scif_poll(...) does *not* return after scif_close(...) is closed, and
> >> hence the scif pthread never ends.
> >>
> >> this is likely a bug in MPSS and it might have been fixed in earlier
> >> release.
> >>
> >> Nathan, could you try scif_hang in your environment and report the MPSS
> >> version you are running ?
> >>
> >>
> >> bottom line, and once again, in my test environment, pthread_join (...)
> >> without pthread_cancel(...)
> >> might cause a hang when the btl/scif module is released.
> >>
> >>
> >> assuming the bug is in old MPSS and has been fixed in recent releases,
> >> what is the OpenMPI policy ?
> >> a) test the MPSS version and call pthread_cancel() or do *not* call
> >> pthread_join if buggy MPSS is detected ?
> >> b) display an error/warning if a buggy MPSS is detected ?
> >> c) do not call pthread_join at all ? /* SIGSEGV might occur with older
> >> MPSS, it is in MPI_Finalize() so impact is limited */
> >> d) do nothing, let the btl/scif module hang, this is *not* an OpenMPI
> >> problem after all ?
> >> e) something else ?
> >>
> >> Gilles
> >> _______________________________________________
> >> devel mailing list
> >> devel_at_[hidden]
> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/05/14786.php
> >
> > _______________________________________________
> > devel mailing list
> > devel_at_[hidden]
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/05/14787.php
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/05/14788.php
>

-- 
Paul H. Hargrove                          PHHargrove_at_[hidden]
Future Technologies Group
Computer and Data Sciences Department     Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900