What happens if I deliver sigusr2 to mpirun ? What I observe (for both
ssh/rsh and torque) that if I deliver a sigusr2 to mpirun, the signal does
get propagated to the mpi procs, which do invoke the signal handler I
registered, but the job is terminated right after that. However, if I
deliver the signal directly to the mpi procs, the signal handler is invoked,
and the job continues to run.
So, I think that what was intended to happen is the correct thing, but for
some reason it is not happening.
On 4/8/08 1:47 PM, "Ralph H Castain" <rhc_at_[hidden]> wrote:
> I found what Pak said a little confusing as the wait_daemon function doesn't
> actually receive a signal itself - it only detects that a proc has exited
> and checks to see if that happened due to a signal. If so, it flags that
> situation and will order the job aborted.
> So if the proc continues alive, the fact that it was hit with SIGUSR2 will
> not be detected by ORTE nor will anything happen - however, if the OS uses
> SIGUSR2 to terminate the proc, or if the proc terminates when it gets that
> signal, we will see that proc terminate due to signal and abort the rest of
> the job.
> We could change it if that is what people want - it is trivial to insert
> code to say "kill everything except if it died due to a certain signal".
> <shrug> up to you folks. Current behavior is what you said you wanted a long
> time ago - nothing has changed in this regard for several years.
> On 4/8/08 11:36 AM, "Pak Lui" <Pak.Lui_at_[hidden]> wrote:
>> First, can your user executable create a signal handler to catch the
>> SIGUSR2 to not exit? By default on Solaris it is going to exit, unless
>> you catch the signal and have the process to do nothing.
>> from signal(3HEAD)
>> Name Value Default Event
>> SIGUSR1 16 Exit User Signal 1
>> SIGUSR2 17 Exit User Signal 2
>> The other thing is, I suspect orte_plm_rsh_wait_daemon() in the rsh plm
>> might cause the processes to exit if the orted (or mpirun if it's on
>> HNP) receives a signal like SIGUSR2; it'd work on killing all the user
>> processes on that node once it receives a signal.
>> I workaround this for gridengine PLM. Once the gridengine_wait_daemon()
>> receives a SIGUSR1/SIGUSR2 signal, it just lets the signals to
>> acknowledge a signal returns, without declaring the launch_failed which
>> would kill off the user processes. The signals would also get passed to
>> the user processes, and let them decide what to do with the signals
>> SGE needed this so the job kill or job suspension notification to work
>> properly since they would send a SIGUSR1/2 to mpirun. I believe this is
>> probably what you need in the rsh plm.
>> Richard Graham wrote:
>>> I am running into a situation where I am trying to deliver a signal to the
>>> mpi procs (sigusr2). I deliver this to mpirun, which propagates it to the
>>> mpi procs, but then proceeds to kill the children. Is there an easy way
>>> that I can get around this ? I am using this mechanism in a situation where
>>> I don't have a debugger, and trying to use this to turn on debugging when I
>>> hit a hang, so killing the mpi procs is really not what I want to have
>>> devel mailing list
> devel mailing list