Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC: Suspend/resume enhancements
From: Iain Bason (Iain.Bason_at_[hidden])
Date: 2010-01-27 09:31:38

Having heard no further comments, I plan to integrate this into the
trunk on Monday.


On Jan 5, 2010, at 6:27 AM, Terry Dontje wrote:

> This only happens when the orte_forward_job_control MCA flag is set
> to 1 and the default is that it is set to 0. Which I believe meets
> Ralph's criteria below.
> --td
> Ralph Castain wrote:
>> I don't have any issue with this so long as (a) it is -only- active
>> when someone sets a specific MCA param requesting it, and (b) that
>> flag is -not- set by default.
>> On Jan 4, 2010, at 11:50 AM, Iain Bason wrote:
>>> WHAT: Enhance the orte_forward_job_control MCA flag by:
>>> 1. Forwarding signals to descendants of launched processes; and
>>> 2. Forwarding signals received before process launch time.
>>> (The orte_forward_job_control flag arranges for SIGTSTP and
>>> SIGCONT to
>>> be forwarded. This allows a resource manager like Sun Grid Engine
>>> to
>>> suspend a job by sending a SIGTSTP signal to mpirun.)
>>> WHY: Some programs do "mpirun", and starts multiple
>>> processes. Among these programs is weather prediction code from
>>> the UK Met Office. This code is used at multiple sites around
>>> the world. Since other MPI implementations* forward job control
>>> signals this way, we risk having OMPI excluded unless we
>>> implement this feature.
>>> [*I have personally verified that Intel MPI does it. I have
>>> heard that Scali does it. I don't know about the others.]
>>> HOW: To allow signals to be sent to descendants of launched
>>> processes,
>>> use the setpgrp() system call to create a new process group for
>>> each launched process. Then send the signal to the process group
>>> rather than to the process.
>>> To allow signals received before process launch time to be
>>> delivered when the processes are launched, add a job state flag
>>> to indicate whether the job is suspended. Check this flag at
>>> launch time, and send a signal immediately after launching.
>>> WHERE:
>>> WHEN: We would like to integrate this into the 1.5 branch.
>>> TIMEOUT: COB Tuesday, January 19, 2010.
>>> Q&A:
>>> 1. Will this work for Windows?
>>> I don't know what would be required to make this work for
>>> Windows. The current implementation is for Unix only.
>>> 2. Will this work for interactive ssh/rsh PLM?
>>> It will not work any better or worse than the current
>>> implementation. One can suspend a job by typing Ctl-Z at a
>>> terminal, but the mpirun process itself never gets suspended.
>>> That means that in order to wake the job up one has to open a
>>> different terminal to send a SIGCONT to the mpirun process. It
>>> would be desirable to fix this problem, but as this feature is
>>> intended for use with resource managers like SGE it isn't
>>> essential to make it work smoothly in an interactive shell.
>>> 3. Will the creation of new process groups prohibit SGE from killing
>>> a job properly?
>>> No. SGE has a mechanism to ensure that all a job's processes are
>>> killed, regardless of whether they create new process groups.
>>> 4. What about other resource managers?
>>> Using this flag with another resource manager might cause
>>> problems. However, the flag may not be necessary with other
>>> resource managers. (If the RM can send SIGSTOP to all the
>>> processes on all the nodes running a job, then mpirun doesn't
>>> need to forward job control signals.)
>>> According to the SLURM documentation, plugins are available
>>> (e.g., linuxproc) that would allow reliable termination of all a
>>> job's processes, regardless of whether they create new process
>>> groups.
>>> []
>>> 5. Will the creation of new process groups prevent mpirun from
>>> shutting down the job successfully (e.g., when it receives a
>>> No. I have tested jobs both with and without calls to
>>> MPI_Comm_Spawn, and all are properly terminated.
>>> 6. Can we avoid creating new process groups by just signaling the
>>> launched process plus any process that calls MPI_Init?
>>> No. The shell script might launch other background processes
>>> that the user wants to suspend. (The Met Office code does this.)
>>> 7. Can we avoid creating new process groups by having mpirun and
>>> orted send SIGTSTP to their own process groups, and ignore the
>>> signal that they send to themselves?
>>> No. First, mpirun might be in the same process group as other
>>> mpirun processes. Those mpiruns could get into an infinite loop
>>> forwarding SIGTSTPs to one another. Second, although the default
>>> action on receipt of SIGTSTP is to suspend the process, that only
>>> happens if the process is not in an orphaned process group. SGE
>>> starts processes in orphaned process groups.
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
> _______________________________________________
> devel mailing list
> devel_at_[hidden]