Hmmm...well, I'll take a look. I haven't seen that behavior, but I haven't
checked it in some time.
On 4/8/08 11:54 AM, "Richard Graham" <rlgraham_at_[hidden]> wrote:
> What happens if I deliver sigusr2 to mpirun ? What I observe (for both
> ssh/rsh and torque) that if I deliver a sigusr2 to mpirun, the signal does
> get propagated to the mpi procs, which do invoke the signal handler I
> registered, but the job is terminated right after that. However, if I
> deliver the signal directly to the mpi procs, the signal handler is invoked,
> and the job continues to run.
> So, I think that what was intended to happen is the correct thing, but for
> some reason it is not happening.
> On 4/8/08 1:47 PM, "Ralph H Castain" <rhc_at_[hidden]> wrote:
>> I found what Pak said a little confusing as the wait_daemon function doesn't
>> actually receive a signal itself - it only detects that a proc has exited
>> and checks to see if that happened due to a signal. If so, it flags that
>> situation and will order the job aborted.
>> So if the proc continues alive, the fact that it was hit with SIGUSR2 will
>> not be detected by ORTE nor will anything happen - however, if the OS uses
>> SIGUSR2 to terminate the proc, or if the proc terminates when it gets that
>> signal, we will see that proc terminate due to signal and abort the rest of
>> the job.
>> We could change it if that is what people want - it is trivial to insert
>> code to say "kill everything except if it died due to a certain signal".
>> <shrug> up to you folks. Current behavior is what you said you wanted a long
>> time ago - nothing has changed in this regard for several years.
>> On 4/8/08 11:36 AM, "Pak Lui" <Pak.Lui_at_[hidden]> wrote:
>>> First, can your user executable create a signal handler to catch the
>>> SIGUSR2 to not exit? By default on Solaris it is going to exit, unless
>>> you catch the signal and have the process to do nothing.
>>> from signal(3HEAD)
>>> Name Value Default Event
>>> SIGUSR1 16 Exit User Signal 1
>>> SIGUSR2 17 Exit User Signal 2
>>> The other thing is, I suspect orte_plm_rsh_wait_daemon() in the rsh plm
>>> might cause the processes to exit if the orted (or mpirun if it's on
>>> HNP) receives a signal like SIGUSR2; it'd work on killing all the user
>>> processes on that node once it receives a signal.
>>> I workaround this for gridengine PLM. Once the gridengine_wait_daemon()
>>> receives a SIGUSR1/SIGUSR2 signal, it just lets the signals to
>>> acknowledge a signal returns, without declaring the launch_failed which
>>> would kill off the user processes. The signals would also get passed to
>>> the user processes, and let them decide what to do with the signals
>>> SGE needed this so the job kill or job suspension notification to work
>>> properly since they would send a SIGUSR1/2 to mpirun. I believe this is
>>> probably what you need in the rsh plm.
>>> Richard Graham wrote:
>>>> I am running into a situation where I am trying to deliver a signal to the
>>>> mpi procs (sigusr2). I deliver this to mpirun, which propagates it to the
>>>> mpi procs, but then proceeds to kill the children. Is there an easy way
>>>> that I can get around this ? I am using this mechanism in a situation
>>>> I don't have a debugger, and trying to use this to turn on debugging when I
>>>> hit a hang, so killing the mpi procs is really not what I want to have
>>>> devel mailing list
>> devel mailing list
> devel mailing list