Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Reuti (reuti_at_[hidden])
Date: 2007-03-12 16:18:14


Am 12.03.2007 um 20:36 schrieb Ralph Castain:

> ORTE propagates the signal to the application processes, but the ORTE
> daemons never actually look at the signal themselves (looks just
> like a
> message to them). So I'm a little puzzled by that error message
> about the
> "daemon received signal 12" - I suspect that's just a misleading
> message
> that was supposed to indicate that a daemon was given a signal to
> pass on.
>
> Just to clarify: the daemons are moved out of your initial process
> group to

Is this still the case also in SGE mode? It was the reason why I
never wrote a Howto for a Tight Integration under SGE. Instead I
looked forward for the final 1.2 with full SGE support.

And: this might be odd under SGE. I must admit, that I didn't have
had the time up to play with OpenMPI 1.2-beta for the Tight
Integration, but it sounds to me like (under Linux) the orte-daemons
could survive although the job was already killed (by processgroup),
as the final stop/kill can't be caught and forwarded.

I'll check this ASAP with 1.2-beta. I have only access to Linux
clusters.

But now we are going beyond Mark's initial problem.

-- Reuti

> avoid seeing any signals from your terminal. When you issue a
> signal, mpirun
> picks it up and forwards it to your application processes via the ORTE
> daemons - the ORTE daemons, however, do *not* look at that signal
> but just
> pass it along.
>
> As for timing, all we do is pass STOP to the OpenMPI application
> process -
> it's up to the local system as to what happens when a "kill -STOP" is
> issued. It was always my impression that the system stopped process
> execution immediately under that signal, but with some allowance
> for the old
> kernel vs user space issue.
>
> Once all the processes have terminated, mpirun tells the daemons to
> go ahead
> and exit. That's the only way the daemons get terminated in this
> procedure.
>
> Can you tell us something about your system? Is this running under
> Linux,
> what kind of OS, how was OpenMPI configured, etc?
>
> Thanks
> Ralph
>
>
>
> On 3/12/07 1:26 PM, "Reuti" <reuti_at_[hidden]> wrote:
>
>> Am 12.03.2007 um 19:55 schrieb Ralph Castain:
>>
>>> I'll have to look into it - I suspect this is simply an erroneous
>>> message
>>> and that no daemon is actually being started.
>>>
>>> I'm not entirely sure I understand what's happening, though, in
>>> your code.
>>> Are you saying that mpirun starts some number of application
>>> processes which
>>> run merrily along, and then qsub sends out USR1/2 signals followed
>>> by STOP
>>> and then KILL in an effort to abort the job? So the application
>>> processes
>>> don't normally terminate, but instead are killed via these signals?
>>
>> If you specify -notify in SGE with the qsub, then jobs are warned by
>> the sge_shepered (parent if the job) during execution, so that they
>> could perfom some proper shutdown action, before they are really
>> stopped/killed:
>>
>> for suspend: USR1 -wait-defined-time- STOP
>> for kill: USR2 -wait-defined-time- KILL
>>
>> Worth to be noted: the signals are sent to the complete processgroup
>> of the job created by the jobscript and mpirun, but not to each
>> daemon which is created by the internal qrsh on any of the slave
>> nodes! This should be orte's duty.
>>
>> Question is also: are OpenMPI jobs surviving a STOP for some time at
>> all, or will there be timing issues due to communication timeouts?
>>
>> HTH - Reuti
>>
>>
>>>
>>> Just want to ensure I understand the scenario here as that is
>>> something
>>> obviously unique to GE.
>>>
>>> Thanks
>>> Ralph
>>>
>>>
>>> On 3/12/07 9:42 AM, "Olesen, Mark" <Mark.Olesen_at_[hidden]>
>>> wrote:
>>>
>>>> I'm testing openmpi 1.2rc1 with GridEngine 6.0u9 and ran into
>>>> interesting
>>>> behaviour when using the qsub -notify option.
>>>> With -notify, USR1 and USR2 are sent X seconds before sending STOP
>>>> and KILL
>>>> signals, respectively.
>>>>
>>>> When the USR2 signal is sent to the process group with the mpirun
>>>> process, I
>>>> receive an error message about not being able to start a daemon:
>>>>
>>>> mpirun: Forwarding signal 12 to job[dealc12:18212] ERROR: A daemon
>>>> on node
>>>> dealc12 failed to start as expected.
>>>> [dealc12:18212] ERROR: There may be more information available from
>>>> [dealc12:18212] ERROR: the 'qstat -t' command on the Grid Engine
>>>> tasks.
>>>> [dealc12:18212] ERROR: If the problem persists, please restart the
>>>> [dealc12:18212] ERROR: Grid Engine PE job
>>>> [dealc12:18212] The daemon received a signal 12.
>>>> [dealc12:18212] ERROR: A daemon on node dealc20 failed to start as
>>>> expected.
>>>> [dealc12:18212] ERROR: There may be more information available from
>>>> [dealc12:18212] ERROR: the 'qstat -t' command on the Grid Engine
>>>> tasks.
>>>> [dealc12:18212] ERROR: If the problem persists, please restart the
>>>> [dealc12:18212] ERROR: Grid Engine PE job
>>>> [dealc12:18212] The daemon received a signal 12.
>>>>
>>>> The job eventually stops, but the mpirun process itself continues
>>>> to live
>>>> (just the ppid changes).
>>>>
>>>> According to orte(1)/Signal Propagation, USR1 and USR2 should be
>>>> propagated
>>>> to all processes in the job (which seems to be happening), but why
>>>> is a
>>>> daemon start being attempted and the mpirun not being stopped?
>>>>
>>>> /mark
>>>>
>>>> This e-mail message and any attachments may contain legally
>>>> privileged,
>>>> confidential or proprietary Information, or information otherwise
>>>> protected by
>>>> law of ArvinMeritor, Inc., its affiliates, or third parties. This
>>>> notice
>>>> serves as marking of its „Confidential‰ status as defined in any
>>>> confidentiality agreements concerning the sender and recipient. If
>>>> you are not
>>>> the intended recipient(s), or the employee or agent responsible
>>>> for delivery
>>>> of this message to the intended recipient(s), you are hereby
>>>> notified that any
>>>> dissemination, distribution or copying of this e-mail message is
>>>> strictly
>>>> prohibited. If you have received this message in error, please
>>>> immediately
>>>> notify the sender and delete this e-mail message from your
>>>> computer.
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users