On Dec 8, 2008, at 11:11 AM, Ralph Castain wrote:
> It sounds reasonable to me. I agree with Ralf W about having mpirun
> send a STOP to itself - that would seem to solve the problem about
> stopping everything.
>
> It would seem, however, that you cannot similarly STOP the daemons
> or else you won't be able to CONT the job. I'm not sure how big a
> deal that is - I can't think of any issue it creates offhand.
>
> Is there any issue in the MPI comm layers if you abruptly STOP a
> process while it's communicating? Especially since the STOP is going
> to be asynchronous. Do you need to quiet networks like IB first?
It might be better to allow the MPI procs to do "something" before
actually stopping. This might prevent timeout-sensitive stuff from
failing (although I don't know if Josh's CR code even handles these
kinds of things...?). The obvious case that I can think of is if the
MPI process is stopped in the middle of an openib CM action. None of
the openib CPC's can currently handle a timeout nicely.
--
Jeff Squyres
Cisco Systems
|