Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC: Suspend/resume enhancements
From: Terry Dontje (Terry.Dontje_at_[hidden])
Date: 2010-01-05 06:27:01

This only happens when the orte_forward_job_control MCA flag is set to 1
and the default is that it is set to 0. Which I believe meets Ralph's
criteria below.


Ralph Castain wrote:
> I don't have any issue with this so long as (a) it is -only- active when someone sets a specific MCA param requesting it, and (b) that flag is -not- set by default.
> On Jan 4, 2010, at 11:50 AM, Iain Bason wrote:
>> WHAT: Enhance the orte_forward_job_control MCA flag by:
>> 1. Forwarding signals to descendants of launched processes; and
>> 2. Forwarding signals received before process launch time.
>> (The orte_forward_job_control flag arranges for SIGTSTP and SIGCONT to
>> be forwarded. This allows a resource manager like Sun Grid Engine to
>> suspend a job by sending a SIGTSTP signal to mpirun.)
>> WHY: Some programs do "mpirun", and starts multiple
>> processes. Among these programs is weather prediction code from
>> the UK Met Office. This code is used at multiple sites around
>> the world. Since other MPI implementations* forward job control
>> signals this way, we risk having OMPI excluded unless we
>> implement this feature.
>> [*I have personally verified that Intel MPI does it. I have
>> heard that Scali does it. I don't know about the others.]
>> HOW: To allow signals to be sent to descendants of launched processes,
>> use the setpgrp() system call to create a new process group for
>> each launched process. Then send the signal to the process group
>> rather than to the process.
>> To allow signals received before process launch time to be
>> delivered when the processes are launched, add a job state flag
>> to indicate whether the job is suspended. Check this flag at
>> launch time, and send a signal immediately after launching.
>> WHEN: We would like to integrate this into the 1.5 branch.
>> TIMEOUT: COB Tuesday, January 19, 2010.
>> Q&A:
>> 1. Will this work for Windows?
>> I don't know what would be required to make this work for
>> Windows. The current implementation is for Unix only.
>> 2. Will this work for interactive ssh/rsh PLM?
>> It will not work any better or worse than the current
>> implementation. One can suspend a job by typing Ctl-Z at a
>> terminal, but the mpirun process itself never gets suspended.
>> That means that in order to wake the job up one has to open a
>> different terminal to send a SIGCONT to the mpirun process. It
>> would be desirable to fix this problem, but as this feature is
>> intended for use with resource managers like SGE it isn't
>> essential to make it work smoothly in an interactive shell.
>> 3. Will the creation of new process groups prohibit SGE from killing
>> a job properly?
>> No. SGE has a mechanism to ensure that all a job's processes are
>> killed, regardless of whether they create new process groups.
>> 4. What about other resource managers?
>> Using this flag with another resource manager might cause
>> problems. However, the flag may not be necessary with other
>> resource managers. (If the RM can send SIGSTOP to all the
>> processes on all the nodes running a job, then mpirun doesn't
>> need to forward job control signals.)
>> According to the SLURM documentation, plugins are available
>> (e.g., linuxproc) that would allow reliable termination of all a
>> job's processes, regardless of whether they create new process
>> groups.
>> []
>> 5. Will the creation of new process groups prevent mpirun from
>> shutting down the job successfully (e.g., when it receives a
>> No. I have tested jobs both with and without calls to
>> MPI_Comm_Spawn, and all are properly terminated.
>> 6. Can we avoid creating new process groups by just signaling the
>> launched process plus any process that calls MPI_Init?
>> No. The shell script might launch other background processes
>> that the user wants to suspend. (The Met Office code does this.)
>> 7. Can we avoid creating new process groups by having mpirun and
>> orted send SIGTSTP to their own process groups, and ignore the
>> signal that they send to themselves?
>> No. First, mpirun might be in the same process group as other
>> mpirun processes. Those mpiruns could get into an infinite loop
>> forwarding SIGTSTPs to one another. Second, although the default
>> action on receipt of SIGTSTP is to suspend the process, that only
>> happens if the process is not in an orphaned process group. SGE
>> starts processes in orphaned process groups.
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
> _______________________________________________
> devel mailing list
> devel_at_[hidden]