Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] about btl/scif thread cancellation (#4616 / r31738)
From: George Bosilca (bosilca_at_[hidden])
Date: 2014-05-14 11:04:04


It sounds more like a suboptimal usage of the pthread cancelation
helpers than a real issue with the pthread_cancel itself. I do agree
the usage is not necessarily straightforward even for a veteran coder,
but the related issues remain belong to the realm of implementation
not at the conceptual level.

  George.

On Tue, May 13, 2014 at 10:56 PM, Paul Hargrove <phhargrove_at_[hidden]> wrote:
> George,
>
> Just my USD0.02:
>
> With pthreads many system calls (mostly those that might block) become
> "cancellation points" where the implementation checks if the callinf thread
> has been cancelled.
> This means that a thread making any of those calls may simply never return
> (calling pthread_exit() internally), unless extra work has been done to
> prevent this default behavior.
> This makes it very hard to write code that properly cleans up its resources,
> including (but not limited to) file descriptors and malloc()ed memory.
> Even if Open MPI is written very carefully, one cannot assume that all the
> libraries it calls (and their dependencies, etc.) are written to properly
> deal with cancellation.
>
> -Paul
>
>
> On Tue, May 13, 2014 at 7:32 PM, George Bosilca <bosilca_at_[hidden]> wrote:
>>
>> I heard multiple references to pthread_cancel being known to have bad
>> side effects. Can somebody educate my on this topic please?
>>
>> Thanks,
>> George.
>>
>>
>>
>> On Tue, May 13, 2014 at 10:25 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>> > It could be a bug in the software stack, though I wouldn't count on it.
>> > Unfortunately, pthread_cancel is known to have bad side effects, and so we
>> > avoid its use.
>> >
>> > The key here is that the thread must detect that the file descriptor has
>> > closed and exit, or use some other method for detecting that it should
>> > terminate. We do this in multiple other places in the code, without using
>> > pthread_cancel and without hanging. So it is certainly doable.
>> >
>> > I don't know the specifics of why Nathan's code is having trouble
>> > exiting, but I suspect that a simple solution - not involving pthread_cancel
>> > - can be readily developed.
>> >
>> >
>> > On May 13, 2014, at 7:18 PM, Gilles Gouaillardet
>> > <gilles.gouaillardet_at_[hidden]> wrote:
>> >
>> >> Folks,
>> >>
>> >> i would like to comment on r31738 :
>> >>
>> >>> There is no reason to cancel the listening thread. It should die
>> >>> automatically when the file descriptor is closed.
>> >> i could not agree more
>> >>> It is sufficient to just wait for the thread to exit with pthread
>> >>> join.
>> >> unfortunatly, at least in my test environment (an outdated MPSS 2.1) it
>> >> is *not* :-(
>> >>
>> >> this is what i described in #4615
>> >> https://svn.open-mpi.org/trac/ompi/ticket/4615
>> >> in which i attached scif_hang.c that evidences that (at least in my
>> >> environment)
>> >> scif_poll(...) does *not* return after scif_close(...) is closed, and
>> >> hence the scif pthread never ends.
>> >>
>> >> this is likely a bug in MPSS and it might have been fixed in earlier
>> >> release.
>> >>
>> >> Nathan, could you try scif_hang in your environment and report the MPSS
>> >> version you are running ?
>> >>
>> >>
>> >> bottom line, and once again, in my test environment, pthread_join (...)
>> >> without pthread_cancel(...)
>> >> might cause a hang when the btl/scif module is released.
>> >>
>> >>
>> >> assuming the bug is in old MPSS and has been fixed in recent releases,
>> >> what is the OpenMPI policy ?
>> >> a) test the MPSS version and call pthread_cancel() or do *not* call
>> >> pthread_join if buggy MPSS is detected ?
>> >> b) display an error/warning if a buggy MPSS is detected ?
>> >> c) do not call pthread_join at all ? /* SIGSEGV might occur with older
>> >> MPSS, it is in MPI_Finalize() so impact is limited */
>> >> d) do nothing, let the btl/scif module hang, this is *not* an OpenMPI
>> >> problem after all ?
>> >> e) something else ?
>> >>
>> >> Gilles
>> >> _______________________________________________
>> >> devel mailing list
>> >> devel_at_[hidden]
>> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> >> Link to this post:
>> >> http://www.open-mpi.org/community/lists/devel/2014/05/14786.php
>> >
>> > _______________________________________________
>> > devel mailing list
>> > devel_at_[hidden]
>> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> > Link to this post:
>> > http://www.open-mpi.org/community/lists/devel/2014/05/14787.php
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/05/14788.php
>
>
>
>
> --
> Paul H. Hargrove PHHargrove_at_[hidden]
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/05/14790.php