curious -- what's difficult about this? SIGTSTP and SIGCONT can be caught;
is there something preventing us from sending "stop" and "continue"
messages (just like we send "die" messages)?
had to guess, I think the user is asking because some other MPI implementations
implement this kind of behavior)
Actually, there were some implementation issues that might prevent
this from working and were the reason we didn't implement it right away. We
don't actually transmit the SIGTERM - we capture it in mpirun and then
propagate our own "die" command to the remote processes and daemons.
Fortunately, "die" is very easy to implement.
Unfortunately, "stop" and
"continue" are much harder to implement from inside of a process. We'll have
to look at it, but this may not really be
Jeff Squyres (jsquyres) wrote:
The main reason that it doesn't work is because we didn't do any thing
to make it work. :-)
Specifically, mpirun is not intercepting SIGSTOP and passing it on to
the remote nodes. There is nothing in the design or architecture that
would prevent this, but we just don't do it [yet].
[mailto:firstname.lastname@example.org] On Behalf Of Pak Lui
Sent: Thursday, June 01, 2006 5:02 PM
Subject: [OMPI devel] SIGSTOP and SIGCONT on orted
I have a question on signals. Normally when I do a SIGTERM
on mpirun, the signal seems to get handled in a way that it
to the orted and processes on the execution hosts. However,
when I send
a SIGSTOP to mpirun, mpirun seems to have stopped, but the
the user executable continue to run. I guess I could hook up the
debugger to mpirun and orted to see why they are handled differently,
but I guess I anxious to hear about it here.
I am trying to see the behavior of SIGSTOP and SIGCONT for the
suspension/resumption feature in N1GE. It'll try to use these
stop and continue both mpirun and orted (and its processes), but the
signals (SIGSTOP and SIGCONT) don't seem to get propagated to
I can see there are some issues for implementing this feature on N1GE
because the 'qrsh' interface does not send the signal to orted on the
remote node, but only to 'mpirun'. I am trying to see how to
- Pak Lui
devel mailing list
devel mailing list