Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Forwarding SIGTSTP and SIGCONT
From: Terry Dontje (Terry.Dontje_at_[hidden])
Date: 2008-12-11 14:55:48


Jeff Squyres wrote:
> On Dec 8, 2008, at 11:11 AM, Ralph Castain wrote:
>
>> It sounds reasonable to me. I agree with Ralf W about having mpirun
>> send a STOP to itself - that would seem to solve the problem about
>> stopping everything.
>>
>> It would seem, however, that you cannot similarly STOP the daemons or
>> else you won't be able to CONT the job. I'm not sure how big a deal
>> that is - I can't think of any issue it creates offhand.
>>
>> Is there any issue in the MPI comm layers if you abruptly STOP a
>> process while it's communicating? Especially since the STOP is going
>> to be asynchronous. Do you need to quiet networks like IB first?
>
> It might be better to allow the MPI procs to do "something" before
> actually stopping. This might prevent timeout-sensitive stuff from
> failing (although I don't know if Josh's CR code even handles these
> kinds of things...?). The obvious case that I can think of is if the
> MPI process is stopped in the middle of an openib CM action. None of
> the openib CPC's can currently handle a timeout nicely.
>
Well under SGE it allows you to have SGE send mpirun SIGUSR1 so many
minutes before sending the Suspend signal.

--td