Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Forwarding SIGTSTP and SIGCONT
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2008-12-11 15:19:10

I would expect that you will hit problems with timeouts throughout the
codebase as Jeff mentioned, particularly with network connections.
Having a 'prepare to suspend' signal followed by a 'suspend now'
signal might work since it should provide enough of a window to ready
the application for the suspension.

I think the first step is to try it, being sure to let the process
stay suspended for a considerable amount of time (15 min to an hour at
least) and see what effects this has. I would expect a series of
errors, but I haven't tried it so I can't say for sure.

If there are errors then looking at the internal notification stuff
that the C/R mechanism uses would be a good place to start since it
handles these types of issues for a checkpoint operation.

-- Josh

On Dec 11, 2008, at 3:08 PM, Jeff Squyres wrote:

> On Dec 11, 2008, at 2:55 PM, Terry Dontje wrote:
>> Well under SGE it allows you to have SGE send mpirun SIGUSR1 so
>> many minutes before sending the Suspend signal.
> My point is that the right approach might be to work in the context
> of Josh's CR stuff -- he's already got hooks for "do this right
> before pausing for checkpoint" / "do this right after resuming", etc.
> Sure, we're not checkpointing, but several of the characteristics of
> this action are pretty similar to what is required for checkpointing/
> restarting. So it might be good to use that framework for it...?
> --
> Jeff Squyres
> Cisco Systems
> _______________________________________________
> devel mailing list
> devel_at_[hidden]