Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Reuti (reuti_at_[hidden])
Date: 2007-03-13 08:32:53


Am 12.03.2007 um 21:29 schrieb Ralph Castain:

>> But now we are going beyond Mark's initial problem.

Back to the initial problem: suspending a parallel job in SGE leads to:

19924 1786 19924 S \_ sge_shepherd-45250 -bg
19926 19924 19926 Ts | \_ /bin/sh /var/spool/sge/node39/
job_scripts/45250
19927 19926 19926 T | \_ mpirun -np 4 /home/reuti/mpihello
19928 19927 19926 T | \_ qrsh -inherit -noshell -
nostdin -V node39 /home/reuti/local/openmpi-1.2rc3/bin/orted --no-
daemonize --bootpr
19934 19928 19926 T | | \_ /usr/sge/utilbin/lx24-x86/
rsh -n -p 36878 node39 exec '/usr/sge/utilbin/lx24-x86/qrsh_starter'
'/var/spo
19929 19927 19926 T | \_ qrsh -inherit -noshell -
nostdin -V node44 /home/reuti/local/openmpi-1.2rc3/bin/orted --no-
daemonize --bootpr
19935 19929 19926 T | | \_ /usr/sge/utilbin/lx24-x86/
rsh -n -p 55907 node44 exec '/usr/sge/utilbin/lx24-x86/qrsh_starter'
'/var/spo
19930 19927 19926 T | \_ qrsh -inherit -noshell -
nostdin -V node41 /home/reuti/local/openmpi-1.2rc3/bin/orted --no-
daemonize --bootpr
19939 19930 19926 T | | \_ /usr/sge/utilbin/lx24-x86/
rsh -n -p 59798 node41 exec '/usr/sge/utilbin/lx24-x86/qrsh_starter'
'/var/spo
19931 19927 19926 T | \_ qrsh -inherit -noshell -
nostdin -V node38 /home/reuti/local/openmpi-1.2rc3/bin/orted --no-
daemonize --bootpr
19938 19931 19926 T | \_ /usr/sge/utilbin/lx24-x86/
rsh -n -p 35136 node38 exec '/usr/sge/utilbin/lx24-x86/qrsh_starter'
'/var/spo
19932 1786 19932 S \_ sge_shepherd-45250 -bg
19933 19932 19933 Ss \_ /usr/sge/utilbin/lx24-x86/rshd -l
19936 19933 19936 S \_ /usr/sge/utilbin/lx24-x86/
qrsh_starter /var/spool/sge/node39/active_jobs/45250.1/1.node39 noshell
19937 19936 19937 S \_ /home/reuti/local/
openmpi-1.2rc3/bin/orted --no-daemonize --bootproxy 1 --name 0.0.1 --
num_procs 5 --vpid_st
19940 19937 19937 R \_ /home/reuti/mpihello

The job is still running, and only the master task is stopped. This
is by design in SGE, and the parallel lib should handle it on it's
own. So I request the warnings with -notify in the qsub:

mpirun: Forwarding signal 10 to jobmpirun noticed that job rank 0
with PID 20526 on node node39 exited on signal 10 (User defined
signal 1).
[node39:20513] ERROR: A daemon on node node39 failed to start as
expected.
[node39:20513] ERROR: There may be more information available from
[node39:20513] ERROR: the 'qstat -t' command on the Grid Engine tasks.
[node39:20513] ERROR: If the problem persists, please restart the
[node39:20513] ERROR: Grid Engine PE job
[node39:20513] The daemon received a signal 10.
[node39:20513] ERROR: A daemon on node node42 failed to start as
expected.
[node39:20513] ERROR: There may be more information available from
[node39:20513] ERROR: the 'qstat -t' command on the Grid Engine tasks.
[node39:20513] ERROR: If the problem persists, please restart the
[node39:20513] ERROR: Grid Engine PE job
[node39:20513] The daemon received a signal 10.

Which Mark already found. The USR1/2 by default terminate the
application. So I put into my mpihello.c to ignore the signal:

    signal(SIGUSR1, SIG_IGN);

(yes, the old style should be ok for only ignore and terminate)

mpirun: Forwarding signal 10 to job[node39:20765] ERROR: A daemon on
node node39 failed to start as expected.
[node39:20765] ERROR: There may be more information available from
[node39:20765] ERROR: the 'qstat -t' command on the Grid Engine tasks.
[node39:20765] ERROR: If the problem persists, please restart the
[node39:20765] ERROR: Grid Engine PE job
[node39:20765] The daemon received a signal 10.
[node39:20765] ERROR: A daemon on node node38 failed to start as
expected.
[node39:20765] ERROR: There may be more information available from
[node39:20765] ERROR: the 'qstat -t' command on the Grid Engine tasks.
[node39:20765] ERROR: If the problem persists, please restart the
[node39:20765] ERROR: Grid Engine PE job
[node39:20765] The daemon received a signal 10.
[node39:20765] ERROR: A daemon on node node40 failed to start as
expected.
[node39:20765] ERROR: There may be more information available from
[node39:20765] ERROR: the 'qstat -t' command on the Grid Engine tasks.
[node39:20765] ERROR: If the problem persists, please restart the
[node39:20765] ERROR: Grid Engine PE job
[node39:20765] The daemon received a signal 10.
[node39:20765] ERROR: A daemon on node node44 failed to start as
expected.
[node39:20765] ERROR: There may be more information available from
[node39:20765] ERROR: the 'qstat -t' command on the Grid Engine tasks.
[node39:20765] ERROR: If the problem persists, please restart the
[node39:20765] ERROR: Grid Engine PE job
[node39:20765] The daemon received a signal 10.

And now the odd thing: the jobscript (with the mpirun) is gone on the
head node of this parallel job, but all the spawned qrsh processes
are still there:

#!/bin/sh
trap '' usr1
export PATH=/home/reuti/local/openmpi-1.2rc3/bin:$PATH
mpirun -np $NSLOTS ~/mpihello

20771 1786 20771 S \_ sge_shepherd-45258 -bg
20772 20771 20772 Ss \_ /usr/sge/utilbin/lx24-x86/rshd -l
20775 20772 20775 S \_ /usr/sge/utilbin/lx24-x86/
qrsh_starter /var/spool/sge/node39/active_jobs/45258.1/1.node39 noshell
20776 20775 20776 S \_ /home/reuti/local/
openmpi-1.2rc3/bin/orted --no-daemonize --bootproxy 1 --name 0.0.1 --
num_procs 5 --vpid_st
20778 20776 20776 R \_ /home/reuti/mpihello

So in the SGE case: usr1 should be caught by the mpirun (and not
terminate it), which will notify the daemons to stop each ones child
processes. This would simulate a real suspend, performed by OpenMPI.

Same might be true for usr2 as warning for a sigkill, but this is not
really necessary, as this can also be performed by SGE.

-- Reuti

>> -- Reuti
>>
>>
>>> avoid seeing any signals from your terminal. When you issue a
>>> signal, mpirun
>>> picks it up and forwards it to your application processes via the
>>> ORTE
>>> daemons - the ORTE daemons, however, do *not* look at that signal
>>> but just
>>> pass it along.
>>>
>>> As for timing, all we do is pass STOP to the OpenMPI application
>>> process -
>>> it's up to the local system as to what happens when a "kill -
>>> STOP" is
>>> issued. It was always my impression that the system stopped process
>>> execution immediately under that signal, but with some allowance
>>> for the old
>>> kernel vs user space issue.
>>>
>>> Once all the processes have terminated, mpirun tells the daemons to
>>> go ahead
>>> and exit. That's the only way the daemons get terminated in this
>>> procedure.
>>>
>>> Can you tell us something about your system? Is this running under
>>> Linux,
>>> what kind of OS, how was OpenMPI configured, etc?
>>>
>>> Thanks
>>> Ralph
>>>
>>>
>>>
>>> On 3/12/07 1:26 PM, "Reuti" <reuti_at_[hidden]> wrote:
>>>
>>>> Am 12.03.2007 um 19:55 schrieb Ralph Castain:
>>>>
>>>>> I'll have to look into it - I suspect this is simply an erroneous
>>>>> message
>>>>> and that no daemon is actually being started.
>>>>>
>>>>> I'm not entirely sure I understand what's happening, though, in
>>>>> your code.
>>>>> Are you saying that mpirun starts some number of application
>>>>> processes which
>>>>> run merrily along, and then qsub sends out USR1/2 signals followed
>>>>> by STOP
>>>>> and then KILL in an effort to abort the job? So the application
>>>>> processes
>>>>> don't normally terminate, but instead are killed via these
>>>>> signals?
>>>>
>>>> If you specify -notify in SGE with the qsub, then jobs are
>>>> warned by
>>>> the sge_shepered (parent if the job) during execution, so that they
>>>> could perfom some proper shutdown action, before they are really
>>>> stopped/killed:
>>>>
>>>> for suspend: USR1 -wait-defined-time- STOP
>>>> for kill: USR2 -wait-defined-time- KILL
>>>>
>>>> Worth to be noted: the signals are sent to the complete
>>>> processgroup
>>>> of the job created by the jobscript and mpirun, but not to each
>>>> daemon which is created by the internal qrsh on any of the slave
>>>> nodes! This should be orte's duty.
>>>>
>>>> Question is also: are OpenMPI jobs surviving a STOP for some
>>>> time at
>>>> all, or will there be timing issues due to communication timeouts?
>>>>
>>>> HTH - Reuti
>>>>
>>>>
>>>>>
>>>>> Just want to ensure I understand the scenario here as that is
>>>>> something
>>>>> obviously unique to GE.
>>>>>
>>>>> Thanks
>>>>> Ralph
>>>>>
>>>>>
>>>>> On 3/12/07 9:42 AM, "Olesen, Mark" <Mark.Olesen_at_[hidden]>
>>>>> wrote:
>>>>>
>>>>>> I'm testing openmpi 1.2rc1 with GridEngine 6.0u9 and ran into
>>>>>> interesting
>>>>>> behaviour when using the qsub -notify option.
>>>>>> With -notify, USR1 and USR2 are sent X seconds before sending
>>>>>> STOP
>>>>>> and KILL
>>>>>> signals, respectively.
>>>>>>
>>>>>> When the USR2 signal is sent to the process group with the mpirun
>>>>>> process, I
>>>>>> receive an error message about not being able to start a daemon:
>>>>>>
>>>>>> mpirun: Forwarding signal 12 to job[dealc12:18212] ERROR: A
>>>>>> daemon
>>>>>> on node
>>>>>> dealc12 failed to start as expected.
>>>>>> [dealc12:18212] ERROR: There may be more information available
>>>>>> from
>>>>>> [dealc12:18212] ERROR: the 'qstat -t' command on the Grid Engine
>>>>>> tasks.
>>>>>> [dealc12:18212] ERROR: If the problem persists, please restart
>>>>>> the
>>>>>> [dealc12:18212] ERROR: Grid Engine PE job
>>>>>> [dealc12:18212] The daemon received a signal 12.
>>>>>> [dealc12:18212] ERROR: A daemon on node dealc20 failed to
>>>>>> start as
>>>>>> expected.
>>>>>> [dealc12:18212] ERROR: There may be more information available
>>>>>> from
>>>>>> [dealc12:18212] ERROR: the 'qstat -t' command on the Grid Engine
>>>>>> tasks.
>>>>>> [dealc12:18212] ERROR: If the problem persists, please restart
>>>>>> the
>>>>>> [dealc12:18212] ERROR: Grid Engine PE job
>>>>>> [dealc12:18212] The daemon received a signal 12.
>>>>>>
>>>>>> The job eventually stops, but the mpirun process itself continues
>>>>>> to live
>>>>>> (just the ppid changes).
>>>>>>
>>>>>> According to orte(1)/Signal Propagation, USR1 and USR2 should be
>>>>>> propagated
>>>>>> to all processes in the job (which seems to be happening), but
>>>>>> why
>>>>>> is a
>>>>>> daemon start being attempted and the mpirun not being stopped?
>>>>>>
>>>>>> /mark
>>>>>>
>>>>>> This e-mail message and any attachments may contain legally
>>>>>> privileged,
>>>>>> confidential or proprietary Information, or information otherwise
>>>>>> protected by
>>>>>> law of ArvinMeritor, Inc., its affiliates, or third parties. This
>>>>>> notice
>>>>>> serves as marking of its „Confidential‰ status as defined in any
>>>>>> confidentiality agreements concerning the sender and
>>>>>> recipient. If
>>>>>> you are not
>>>>>> the intended recipient(s), or the employee or agent responsible
>>>>>> for delivery
>>>>>> of this message to the intended recipient(s), you are hereby
>>>>>> notified that any
>>>>>> dissemination, distribution or copying of this e-mail message is
>>>>>> strictly
>>>>>> prohibited. If you have received this message in error, please
>>>>>> immediately
>>>>>> notify the sender and delete this e-mail message from your
>>>>>> computer.
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users