I
guess I had in my head that Josh already working on most of these
issues anyway for the checkpoint / restart work (i.e., all the quiescing
stuff). Indeed, if you think about it -- pause/resume is one form of
a checkpoint/restart. Hence, if the checkpoint/restart frameworks are laid
out right -- and I think they are -- pause/resume may just be a component in the
checkpoint/restart frameworks (there's a little hand-waving going on here, of
course :-), but I'm trusting that Josh will jump in if I have any heinously
incorrect assumptions).
This
also brings up another [minor] point -- we don't currently propagate signals out
from mpirun to remote processes (e.g., SIGUSR1). There hasn't really been
a need for this yet, so it's been a pretty low priority.
Sorry
for all the confusion, though -- I keyed off the phrase "there were some
implementation issues that might prevent this from working" in your original
e-mail, which I interpreted as "our implementation prohibits
this." :-)
Jeff Squyres (jsquyres) wrote:
Just curious -- what's difficult about this? SIGTSTP and
SIGCONT can be caught; is there something preventing us from
sending "stop" and "continue" messages (just like we send "die"
messages)?
Nothing preventing it at all. The
problem lies in what you do when you receive it. Take the example of a launch
that used orted daemons. We could pass the "stop" or "continue" message to the
orted, which could signal its child processes (i.e., the application processes
on that node) with the appropriate signal. That would stop/continue the child
process just fine - but what about communications that are still in-progress??
Bad news.
So instead you could pass the application process a "stop"
message. The process could then "quiet" the MPI-based messaging system, reply
back to the orted that all is now quiet, and then the orted could send the
appropriate OS-level signal so the process would truly "stop". "Continue" is
much easier, of course - there is no "quieting" to be done, so the orted could
just issue a "continue" signal to its children.
Great - except we still
haven't "stopped" the run-time! What happens if the registry is in the middle
of a notification process (e.g., we hit a stage gate and all the notification
messages are being sent, or someone is in the middle of a put that causes a
set of subscriptions to fire and send out messages - that may in turn cause
additional action on the remote host)? What about messages being routed
through the orteds (once we get the routing system in-place)?
Well, we
now could go through a similar process to first "quiet" the run-time itself.
We would have to ensure that every subsystem completed its on-going operation
and then "stopped". We would of course have to tell all the remote processes
to "stop" first so that new requests would quit coming in, or else this
process would never complete. Note that this means the remote processes would
have to receive and "log" any notifications that come in from the registry
after we tell the process to "stop", but could not take action on those
notices until we "continue" the process.
So now we have the MPI and
run-time layers "quiet". We send a message to the remote orteds indicating
they should go ahead and send their local application processes an OS-level
signal to "stop" so that the OS knows not to spend cycles on them.
Unfortunately, we cannot do the same for the orteds themselves, so that means
that the orteds remain "awake" and operating, but they can just
"spin".
All sounds fine. Now all we have to deal with are: all the race
conditions inherent in what I just described; how to deal with receipt of
asynchronous notifications when we've already been told to stop; the scenarios
where we don't have orted daemons on every node; how to stop/restart major MPI
collectives in mid operation; etc. etc.
Not saying it cannot be done -
just indicating that there were reasons why it wasn't initially done other
than "we just didn't get around to it". :-)
(If I had to guess, I think the user is asking because some other MPI
implementations implement this kind of behavior)
Thanks!
Actually, there
were some implementation issues that might prevent this from working and
were the reason we didn't implement it right away. We don't actually
transmit the SIGTERM - we capture it in mpirun and then propagate our own
"die" command to the remote processes and daemons. Fortunately, "die" is
very easy to implement.
Unfortunately, "stop" and "continue" are
much harder to implement from inside of a process. We'll have to look at
it, but this may not really be feasible.
Ralph
Jeff
Squyres (jsquyres) wrote:
The main reason that it doesn't work is because we didn't do any thing
to make it work. :-)
Specifically, mpirun is not intercepting SIGSTOP and passing it on to
the remote nodes. There is nothing in the design or architecture that
would prevent this, but we just don't do it [yet].
-----Original Message-----
From: devel-bounces@open-mpi.org
[mailto:devel-bounces@open-mpi.org] On Behalf Of Pak Lui
Sent: Thursday, June 01, 2006 5:02 PM
To: devel@open-mpi.org
Subject: [OMPI devel] SIGSTOP and SIGCONT on orted
Hi,
I have a question on signals. Normally when I do a SIGTERM
(control-C)
on mpirun, the signal seems to get handled in a way that it
broadcasts
to the orted and processes on the execution hosts. However,
when I send
a SIGSTOP to mpirun, mpirun seems to have stopped, but the
processes of
the user executable continue to run. I guess I could hook up the
debugger to mpirun and orted to see why they are handled differently,
but I guess I anxious to hear about it here.
I am trying to see the behavior of SIGSTOP and SIGCONT for the
suspension/resumption feature in N1GE. It'll try to use these
signals to
stop and continue both mpirun and orted (and its processes), but the
signals (SIGSTOP and SIGCONT) don't seem to get propagated to
the remote
orted.
I can see there are some issues for implementing this feature on N1GE
because the 'qrsh' interface does not send the signal to orted on the
remote node, but only to 'mpirun'. I am trying to see how to
work around
this.
--
Thanks,
- Pak Lui
pak.lui@sun.com
_______________________________________________
devel mailing list
devel@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
devel@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel