Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI devel] RFC: Suspend/resume enhancements
From: Iain Bason (Iain.Bason_at_[hidden])
Date: 2010-01-27 09:31:38


Having heard no further comments, I plan to integrate this into the
trunk on Monday.

Iain

On Jan 5, 2010, at 6:27 AM, Terry Dontje wrote:

> This only happens when the orte_forward_job_control MCA flag is set
> to 1 and the default is that it is set to 0. Which I believe meets
> Ralph's criteria below.
>
> --td
>
> Ralph Castain wrote:
>> I don't have any issue with this so long as (a) it is -only- active
>> when someone sets a specific MCA param requesting it, and (b) that
>> flag is -not- set by default.
>>
>>
>> On Jan 4, 2010, at 11:50 AM, Iain Bason wrote:
>>
>>
>>> WHAT: Enhance the orte_forward_job_control MCA flag by:
>>>
>>> 1. Forwarding signals to descendants of launched processes; and
>>> 2. Forwarding signals received before process launch time.
>>>
>>> (The orte_forward_job_control flag arranges for SIGTSTP and
>>> SIGCONT to
>>> be forwarded. This allows a resource manager like Sun Grid Engine
>>> to
>>> suspend a job by sending a SIGTSTP signal to mpirun.)
>>>
>>> WHY: Some programs do "mpirun prog.sh", and prog.sh starts multiple
>>> processes. Among these programs is weather prediction code from
>>> the UK Met Office. This code is used at multiple sites around
>>> the world. Since other MPI implementations* forward job control
>>> signals this way, we risk having OMPI excluded unless we
>>> implement this feature.
>>>
>>> [*I have personally verified that Intel MPI does it. I have
>>> heard that Scali does it. I don't know about the others.]
>>>
>>> HOW: To allow signals to be sent to descendants of launched
>>> processes,
>>> use the setpgrp() system call to create a new process group for
>>> each launched process. Then send the signal to the process group
>>> rather than to the process.
>>>
>>> To allow signals received before process launch time to be
>>> delivered when the processes are launched, add a job state flag
>>> to indicate whether the job is suspended. Check this flag at
>>> launch time, and send a signal immediately after launching.
>>>
>>> WHERE: http://bitbucket.org/igb/ompi-job-control/
>>>
>>> WHEN: We would like to integrate this into the 1.5 branch.
>>>
>>> TIMEOUT: COB Tuesday, January 19, 2010.
>>>
>>> Q&A:
>>>
>>> 1. Will this work for Windows?
>>>
>>> I don't know what would be required to make this work for
>>> Windows. The current implementation is for Unix only.
>>>
>>> 2. Will this work for interactive ssh/rsh PLM?
>>>
>>> It will not work any better or worse than the current
>>> implementation. One can suspend a job by typing Ctl-Z at a
>>> terminal, but the mpirun process itself never gets suspended.
>>> That means that in order to wake the job up one has to open a
>>> different terminal to send a SIGCONT to the mpirun process. It
>>> would be desirable to fix this problem, but as this feature is
>>> intended for use with resource managers like SGE it isn't
>>> essential to make it work smoothly in an interactive shell.
>>>
>>> 3. Will the creation of new process groups prohibit SGE from killing
>>> a job properly?
>>>
>>> No. SGE has a mechanism to ensure that all a job's processes are
>>> killed, regardless of whether they create new process groups.
>>>
>>> 4. What about other resource managers?
>>>
>>> Using this flag with another resource manager might cause
>>> problems. However, the flag may not be necessary with other
>>> resource managers. (If the RM can send SIGSTOP to all the
>>> processes on all the nodes running a job, then mpirun doesn't
>>> need to forward job control signals.)
>>>
>>> According to the SLURM documentation, plugins are available
>>> (e.g., linuxproc) that would allow reliable termination of all a
>>> job's processes, regardless of whether they create new process
>>> groups.
>>> [https://computing.llnl.gov/linux/slurm/proctrack_plugins.html]
>>>
>>> 5. Will the creation of new process groups prevent mpirun from
>>> shutting down the job successfully (e.g., when it receives a
>>> SIGTERM)?
>>>
>>> No. I have tested jobs both with and without calls to
>>> MPI_Comm_Spawn, and all are properly terminated.
>>>
>>> 6. Can we avoid creating new process groups by just signaling the
>>> launched process plus any process that calls MPI_Init?
>>>
>>> No. The shell script might launch other background processes
>>> that the user wants to suspend. (The Met Office code does this.)
>>>
>>> 7. Can we avoid creating new process groups by having mpirun and
>>> orted send SIGTSTP to their own process groups, and ignore the
>>> signal that they send to themselves?
>>>
>>> No. First, mpirun might be in the same process group as other
>>> mpirun processes. Those mpiruns could get into an infinite loop
>>> forwarding SIGTSTPs to one another. Second, although the default
>>> action on receipt of SIGTSTP is to suspend the process, that only
>>> happens if the process is not in an orphaned process group. SGE
>>> starts processes in orphaned process groups.
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel