Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Reuti (reuti_at_[hidden])
Date: 2007-03-13 11:11:51


Am 13.03.2007 um 06:01 schrieb Ralph Castain:

> I've been letting this rattle around in my head some more, and
> *may* have
> come up with an idea of what *might* be going on.
>
> In the GE environment, qsub only launches the daemons - the daemons
> are the
> ones that actually "launch" your local application processes. If qsub
> -notify uses qsub's knowledge of the processes being executed, then it
> *might* be tempted to send the USR1/2 signals directly to the
> daemons as
> well as mpirun.

Only the processgroup with (jobscript + mpirun + kids) should get it
on the headnode of the parallel job. Like with the sigstop. Otherwise
a suspend of parallel jobs would already be built into SGE.

> In that case, it might be that our daemon's call to separate
> from the process group isn't adequate to break that qsub connection
> - we may
> be separating from the Linux/Solaris process group, but not from
> qsub's list
> of executing processes.
>
> IF that is true, then this could cause some strange behavior. I
> honestly
> have no idea what a USR1/2 signal hitting the daemon would do - we
> don't try
> to trap that signal in the daemon, so it likely would be ignored.

The default is to terminate for usr1/2 AFAIK.

> However,
> it is possible that something unusual could occur (though why it
> would try
> to spawn another daemon is beyond me).
>
> I can assure you, though, that the daemon really won't like getting
> a STOP
> or KILL sent directly to it - this definitely would cause shutdown
> issues

They get a kill for sure, but no stop.

Do you have access to a SGE cluster?

-- Reuti

> with respect to cleanup and possibly cause mpirun and/or your
> application to
> "hang". Again, we don't trap those signals in the daemon (only in
> mpirun
> itself). When mpirun traps them, it sends an "abort" message to the
> daemons
> so they can cleanly exit (terminating their local procs along the
> way), thus
> bringing the system down in an orderly fashion.
>
> Again, IF this is happening, then it could be that the application
> processes
> are getting signals from two sources: (a) as part of the daemon's
> local
> process group on the node (since the daemon fork/exec's the local
> procs),
> and (b) propagated via the daemons by comm from mpirun. This could
> cause
> some interesting race conditions.
>
> Anyway, I think someone more familiar with the peculiarities of
> qsub -notify
> will have to step in here. If my explanation is correct, then we
> likely have
> a problem that needs to be addressed for the GE environment.
> Otherwise,
> there may be something else at work here.
>
> Ralph
>
>
> On 3/12/07 9:42 AM, "Olesen, Mark" <Mark.Olesen_at_[hidden]>
> wrote:
>
>> I'm testing openmpi 1.2rc1 with GridEngine 6.0u9 and ran into
>> interesting
>> behaviour when using the qsub -notify option.
>> With -notify, USR1 and USR2 are sent X seconds before sending STOP
>> and KILL
>> signals, respectively.
>>
>> When the USR2 signal is sent to the process group with the mpirun
>> process, I
>> receive an error message about not being able to start a daemon:
>>
>> mpirun: Forwarding signal 12 to job[dealc12:18212] ERROR: A daemon
>> on node
>> dealc12 failed to start as expected.
>> [dealc12:18212] ERROR: There may be more information available from
>> [dealc12:18212] ERROR: the 'qstat -t' command on the Grid Engine
>> tasks.
>> [dealc12:18212] ERROR: If the problem persists, please restart the
>> [dealc12:18212] ERROR: Grid Engine PE job
>> [dealc12:18212] The daemon received a signal 12.
>> [dealc12:18212] ERROR: A daemon on node dealc20 failed to start as
>> expected.
>> [dealc12:18212] ERROR: There may be more information available from
>> [dealc12:18212] ERROR: the 'qstat -t' command on the Grid Engine
>> tasks.
>> [dealc12:18212] ERROR: If the problem persists, please restart the
>> [dealc12:18212] ERROR: Grid Engine PE job
>> [dealc12:18212] The daemon received a signal 12.
>>
>> The job eventually stops, but the mpirun process itself continues
>> to live
>> (just the ppid changes).
>>
>> According to orte(1)/Signal Propagation, USR1 and USR2 should be
>> propagated
>> to all processes in the job (which seems to be happening), but why
>> is a
>> daemon start being attempted and the mpirun not being stopped?
>>
>> /mark
>>
>> This e-mail message and any attachments may contain legally
>> privileged,
>> confidential or proprietary Information, or information otherwise
>> protected by
>> law of ArvinMeritor, Inc., its affiliates, or third parties. This
>> notice
>> serves as marking of its „Confidential‰ status as defined in any
>> confidentiality agreements concerning the sender and recipient. If
>> you are not
>> the intended recipient(s), or the employee or agent responsible
>> for delivery
>> of this message to the intended recipient(s), you are hereby
>> notified that any
>> dissemination, distribution or copying of this e-mail message is
>> strictly
>> prohibited. If you have received this message in error, please
>> immediately
>> notify the sender and delete this e-mail message from your computer.
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users