Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Jeff Squyres \(jsquyres\) (jsquyres_at_[hidden])
Date: 2006-06-02 08:53:06

Just curious -- what's difficult about this? SIGTSTP and SIGCONT can be
caught; is there something preventing us from sending "stop" and
"continue" messages (just like we send "die" messages)?
(If I had to guess, I think the user is asking because some other MPI
implementations implement this kind of behavior)


        From: devel-bounces_at_[hidden]
[mailto:devel-bounces_at_[hidden]] On Behalf Of Ralph Castain
        Sent: Thursday, June 01, 2006 10:50 PM
        To: Open MPI Developers
        Subject: Re: [OMPI devel] SIGSTOP and SIGCONT on orted
        Actually, there were some implementation issues that might
prevent this from working and were the reason we didn't implement it
right away. We don't actually transmit the SIGTERM - we capture it in
mpirun and then propagate our own "die" command to the remote processes
and daemons. Fortunately, "die" is very easy to implement.
        Unfortunately, "stop" and "continue" are much harder to
implement from inside of a process. We'll have to look at it, but this
may not really be feasible.
        Jeff Squyres (jsquyres) wrote:

                The main reason that it doesn't work is because we
didn't do any thing
                to make it work. :-)
                Specifically, mpirun is not intercepting SIGSTOP and
passing it on to
                the remote nodes. There is nothing in the design or
architecture that
                would prevent this, but we just don't do it [yet].

                        -----Original Message-----
                        From: devel-bounces_at_[hidden]
                        [mailto:devel-bounces_at_[hidden]] On Behalf Of
Pak Lui
                        Sent: Thursday, June 01, 2006 5:02 PM
                        To: devel_at_[hidden]
                        Subject: [OMPI devel] SIGSTOP and SIGCONT on
                        I have a question on signals. Normally when I do
                        on mpirun, the signal seems to get handled in a
way that it
                        to the orted and processes on the execution
hosts. However,
                        when I send
                        a SIGSTOP to mpirun, mpirun seems to have
stopped, but the
                        processes of
                        the user executable continue to run. I guess I
could hook up the
                        debugger to mpirun and orted to see why they are
handled differently,
                        but I guess I anxious to hear about it here.
                        I am trying to see the behavior of SIGSTOP and
SIGCONT for the
                        suspension/resumption feature in N1GE. It'll try
to use these
                        signals to
                        stop and continue both mpirun and orted (and its
processes), but the
                        signals (SIGSTOP and SIGCONT) don't seem to get
propagated to
                        the remote
                        I can see there are some issues for implementing
this feature on N1GE
                        because the 'qrsh' interface does not send the
signal to orted on the
                        remote node, but only to 'mpirun'. I am trying
to see how to
                        work around
                        - Pak Lui
                        devel mailing list

                devel mailing list