Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Signals
From: Pak Lui (Pak.Lui_at_[hidden])
Date: 2008-04-08 13:36:46

First, can your user executable create a signal handler to catch the
SIGUSR2 to not exit? By default on Solaris it is going to exit, unless
you catch the signal and have the process to do nothing.

from signal(3HEAD)
      Name Value Default Event
      SIGUSR1 16 Exit User Signal 1
      SIGUSR2 17 Exit User Signal 2

The other thing is, I suspect orte_plm_rsh_wait_daemon() in the rsh plm
might cause the processes to exit if the orted (or mpirun if it's on
HNP) receives a signal like SIGUSR2; it'd work on killing all the user
processes on that node once it receives a signal.

I workaround this for gridengine PLM. Once the gridengine_wait_daemon()
receives a SIGUSR1/SIGUSR2 signal, it just lets the signals to
acknowledge a signal returns, without declaring the launch_failed which
would kill off the user processes. The signals would also get passed to
the user processes, and let them decide what to do with the signals

SGE needed this so the job kill or job suspension notification to work
properly since they would send a SIGUSR1/2 to mpirun. I believe this is
probably what you need in the rsh plm.

Richard Graham wrote:
> I am running into a situation where I am trying to deliver a signal to the
> mpi procs (sigusr2). I deliver this to mpirun, which propagates it to the
> mpi procs, but then proceeds to kill the children. Is there an easy way
> that I can get around this ? I am using this mechanism in a situation where
> I don't have a debugger, and trying to use this to turn on debugging when I
> hit a hang, so killing the mpi procs is really not what I want to have
> happen.
> Thanks,
> Rich
> _______________________________________________
> devel mailing list
> devel_at_[hidden]

- Pak Lui