I would expect that you will hit problems with timeouts throughout the
codebase as Jeff mentioned, particularly with network connections.
Having a 'prepare to suspend' signal followed by a 'suspend now'
signal might work since it should provide enough of a window to ready
the application for the suspension.
I think the first step is to try it, being sure to let the process
stay suspended for a considerable amount of time (15 min to an hour at
least) and see what effects this has. I would expect a series of
errors, but I haven't tried it so I can't say for sure.
If there are errors then looking at the internal notification stuff
that the C/R mechanism uses would be a good place to start since it
handles these types of issues for a checkpoint operation.
On Dec 11, 2008, at 3:08 PM, Jeff Squyres wrote:
> On Dec 11, 2008, at 2:55 PM, Terry Dontje wrote:
>> Well under SGE it allows you to have SGE send mpirun SIGUSR1 so
>> many minutes before sending the Suspend signal.
> My point is that the right approach might be to work in the context
> of Josh's CR stuff -- he's already got hooks for "do this right
> before pausing for checkpoint" / "do this right after resuming", etc.
> Sure, we're not checkpointing, but several of the characteristics of
> this action are pretty similar to what is required for checkpointing/
> restarting. So it might be good to use that framework for it...?
> Jeff Squyres
> Cisco Systems
> devel mailing list