Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC: Suspend/resume enhancements
From: Ralph Castain (rhc_at_[hidden])
Date: 2010-01-05 00:49:48


I don't have any issue with this so long as (a) it is -only- active when someone sets a specific MCA param requesting it, and (b) that flag is -not- set by default.

On Jan 4, 2010, at 11:50 AM, Iain Bason wrote:

> WHAT: Enhance the orte_forward_job_control MCA flag by:
>
> 1. Forwarding signals to descendants of launched processes; and
> 2. Forwarding signals received before process launch time.
>
> (The orte_forward_job_control flag arranges for SIGTSTP and SIGCONT to
> be forwarded. This allows a resource manager like Sun Grid Engine to
> suspend a job by sending a SIGTSTP signal to mpirun.)
>
> WHY: Some programs do "mpirun prog.sh", and prog.sh starts multiple
> processes. Among these programs is weather prediction code from
> the UK Met Office. This code is used at multiple sites around
> the world. Since other MPI implementations* forward job control
> signals this way, we risk having OMPI excluded unless we
> implement this feature.
>
> [*I have personally verified that Intel MPI does it. I have
> heard that Scali does it. I don't know about the others.]
>
> HOW: To allow signals to be sent to descendants of launched processes,
> use the setpgrp() system call to create a new process group for
> each launched process. Then send the signal to the process group
> rather than to the process.
>
> To allow signals received before process launch time to be
> delivered when the processes are launched, add a job state flag
> to indicate whether the job is suspended. Check this flag at
> launch time, and send a signal immediately after launching.
>
> WHERE: http://bitbucket.org/igb/ompi-job-control/
>
> WHEN: We would like to integrate this into the 1.5 branch.
>
> TIMEOUT: COB Tuesday, January 19, 2010.
>
> Q&A:
>
> 1. Will this work for Windows?
>
> I don't know what would be required to make this work for
> Windows. The current implementation is for Unix only.
>
> 2. Will this work for interactive ssh/rsh PLM?
>
> It will not work any better or worse than the current
> implementation. One can suspend a job by typing Ctl-Z at a
> terminal, but the mpirun process itself never gets suspended.
> That means that in order to wake the job up one has to open a
> different terminal to send a SIGCONT to the mpirun process. It
> would be desirable to fix this problem, but as this feature is
> intended for use with resource managers like SGE it isn't
> essential to make it work smoothly in an interactive shell.
>
> 3. Will the creation of new process groups prohibit SGE from killing
> a job properly?
>
> No. SGE has a mechanism to ensure that all a job's processes are
> killed, regardless of whether they create new process groups.
>
> 4. What about other resource managers?
>
> Using this flag with another resource manager might cause
> problems. However, the flag may not be necessary with other
> resource managers. (If the RM can send SIGSTOP to all the
> processes on all the nodes running a job, then mpirun doesn't
> need to forward job control signals.)
>
> According to the SLURM documentation, plugins are available
> (e.g., linuxproc) that would allow reliable termination of all a
> job's processes, regardless of whether they create new process
> groups.
> [https://computing.llnl.gov/linux/slurm/proctrack_plugins.html]
>
> 5. Will the creation of new process groups prevent mpirun from
> shutting down the job successfully (e.g., when it receives a
> SIGTERM)?
>
> No. I have tested jobs both with and without calls to
> MPI_Comm_Spawn, and all are properly terminated.
>
> 6. Can we avoid creating new process groups by just signaling the
> launched process plus any process that calls MPI_Init?
>
> No. The shell script might launch other background processes
> that the user wants to suspend. (The Met Office code does this.)
>
> 7. Can we avoid creating new process groups by having mpirun and
> orted send SIGTSTP to their own process groups, and ignore the
> signal that they send to themselves?
>
> No. First, mpirun might be in the same process group as other
> mpirun processes. Those mpiruns could get into an infinite loop
> forwarding SIGTSTPs to one another. Second, although the default
> action on receipt of SIGTSTP is to suspend the process, that only
> happens if the process is not in an orphaned process group. SGE
> starts processes in orphaned process groups.
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel