Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Forwarding SIGTSTP and SIGCONT
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2008-12-11 15:30:40


I'm quite sure that the CM CPC stuff (both IBCM -- which doesn't fully
work anyway -- and RDMA CM) will timeout and Bad Things will happen if
you interrupt it in the middle of some network transactions. The
(kernel-imposed) timeout for RDMACM is pretty short -- on the order of
a minute or two.

On Dec 11, 2008, at 3:19 PM, Josh Hursey wrote:

> I would expect that you will hit problems with timeouts throughout
> the codebase as Jeff mentioned, particularly with network
> connections. Having a 'prepare to suspend' signal followed by a
> 'suspend now' signal might work since it should provide enough of a
> window to ready the application for the suspension.
>
> I think the first step is to try it, being sure to let the process
> stay suspended for a considerable amount of time (15 min to an hour
> at least) and see what effects this has. I would expect a series of
> errors, but I haven't tried it so I can't say for sure.
>
> If there are errors then looking at the internal notification stuff
> that the C/R mechanism uses would be a good place to start since it
> handles these types of issues for a checkpoint operation.
>
> -- Josh
>
> On Dec 11, 2008, at 3:08 PM, Jeff Squyres wrote:
>
>> On Dec 11, 2008, at 2:55 PM, Terry Dontje wrote:
>>
>>> Well under SGE it allows you to have SGE send mpirun SIGUSR1 so
>>> many minutes before sending the Suspend signal.
>>
>>
>> My point is that the right approach might be to work in the context
>> of Josh's CR stuff -- he's already got hooks for "do this right
>> before pausing for checkpoint" / "do this right after resuming", etc.
>>
>> Sure, we're not checkpointing, but several of the characteristics
>> of this action are pretty similar to what is required for
>> checkpointing/restarting. So it might be good to use that
>> framework for it...?
>>
>> --
>> Jeff Squyres
>> Cisco Systems
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 
Jeff Squyres
Cisco Systems