Jeff Squyres wrote:
> On Dec 8, 2008, at 11:11 AM, Ralph Castain wrote:
>> It sounds reasonable to me. I agree with Ralf W about having mpirun
>> send a STOP to itself - that would seem to solve the problem about
>> stopping everything.
>> It would seem, however, that you cannot similarly STOP the daemons or
>> else you won't be able to CONT the job. I'm not sure how big a deal
>> that is - I can't think of any issue it creates offhand.
>> Is there any issue in the MPI comm layers if you abruptly STOP a
>> process while it's communicating? Especially since the STOP is going
>> to be asynchronous. Do you need to quiet networks like IB first?
> It might be better to allow the MPI procs to do "something" before
> actually stopping. This might prevent timeout-sensitive stuff from
> failing (although I don't know if Josh's CR code even handles these
> kinds of things...?). The obvious case that I can think of is if the
> MPI process is stopped in the middle of an openib CM action. None of
> the openib CPC's can currently handle a timeout nicely.
Well under SGE it allows you to have SGE send mpirun SIGUSR1 so many
minutes before sending the Suspend signal.