I'm quite sure that the CM CPC stuff (both IBCM -- which doesn't fully
work anyway -- and RDMA CM) will timeout and Bad Things will happen if
you interrupt it in the middle of some network transactions. The
(kernel-imposed) timeout for RDMACM is pretty short -- on the order of
a minute or two.
On Dec 11, 2008, at 3:19 PM, Josh Hursey wrote:
> I would expect that you will hit problems with timeouts throughout
> the codebase as Jeff mentioned, particularly with network
> connections. Having a 'prepare to suspend' signal followed by a
> 'suspend now' signal might work since it should provide enough of a
> window to ready the application for the suspension.
> I think the first step is to try it, being sure to let the process
> stay suspended for a considerable amount of time (15 min to an hour
> at least) and see what effects this has. I would expect a series of
> errors, but I haven't tried it so I can't say for sure.
> If there are errors then looking at the internal notification stuff
> that the C/R mechanism uses would be a good place to start since it
> handles these types of issues for a checkpoint operation.
> -- Josh
> On Dec 11, 2008, at 3:08 PM, Jeff Squyres wrote:
>> On Dec 11, 2008, at 2:55 PM, Terry Dontje wrote:
>>> Well under SGE it allows you to have SGE send mpirun SIGUSR1 so
>>> many minutes before sending the Suspend signal.
>> My point is that the right approach might be to work in the context
>> of Josh's CR stuff -- he's already got hooks for "do this right
>> before pausing for checkpoint" / "do this right after resuming", etc.
>> Sure, we're not checkpointing, but several of the characteristics
>> of this action are pretty similar to what is required for
>> checkpointing/restarting. So it might be good to use that
>> framework for it...?
>> Jeff Squyres
>> Cisco Systems
>> devel mailing list
> devel mailing list