Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Forwarding SIGTSTP and SIGCONT
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2008-12-11 14:48:02


On Dec 8, 2008, at 11:11 AM, Ralph Castain wrote:

> It sounds reasonable to me. I agree with Ralf W about having mpirun
> send a STOP to itself - that would seem to solve the problem about
> stopping everything.
>
> It would seem, however, that you cannot similarly STOP the daemons
> or else you won't be able to CONT the job. I'm not sure how big a
> deal that is - I can't think of any issue it creates offhand.
>
> Is there any issue in the MPI comm layers if you abruptly STOP a
> process while it's communicating? Especially since the STOP is going
> to be asynchronous. Do you need to quiet networks like IB first?

It might be better to allow the MPI procs to do "something" before
actually stopping. This might prevent timeout-sensitive stuff from
failing (although I don't know if Josh's CR code even handles these
kinds of things...?). The obvious case that I can think of is if the
MPI process is stopped in the middle of an openib CM action. None of
the openib CPC's can currently handle a timeout nicely.

-- 
Jeff Squyres
Cisco Systems