Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Ralph Castain (rhc_at_[hidden])
Date: 2007-03-12 14:55:04


I'll have to look into it - I suspect this is simply an erroneous message
and that no daemon is actually being started.

I'm not entirely sure I understand what's happening, though, in your code.
Are you saying that mpirun starts some number of application processes which
run merrily along, and then qsub sends out USR1/2 signals followed by STOP
and then KILL in an effort to abort the job? So the application processes
don't normally terminate, but instead are killed via these signals?

Just want to ensure I understand the scenario here as that is something
obviously unique to GE.

Thanks
Ralph

On 3/12/07 9:42 AM, "Olesen, Mark" <Mark.Olesen_at_[hidden]> wrote:

> I'm testing openmpi 1.2rc1 with GridEngine 6.0u9 and ran into interesting
> behaviour when using the qsub -notify option.
> With -notify, USR1 and USR2 are sent X seconds before sending STOP and KILL
> signals, respectively.
>
> When the USR2 signal is sent to the process group with the mpirun process, I
> receive an error message about not being able to start a daemon:
>
> mpirun: Forwarding signal 12 to job[dealc12:18212] ERROR: A daemon on node
> dealc12 failed to start as expected.
> [dealc12:18212] ERROR: There may be more information available from
> [dealc12:18212] ERROR: the 'qstat -t' command on the Grid Engine tasks.
> [dealc12:18212] ERROR: If the problem persists, please restart the
> [dealc12:18212] ERROR: Grid Engine PE job
> [dealc12:18212] The daemon received a signal 12.
> [dealc12:18212] ERROR: A daemon on node dealc20 failed to start as expected.
> [dealc12:18212] ERROR: There may be more information available from
> [dealc12:18212] ERROR: the 'qstat -t' command on the Grid Engine tasks.
> [dealc12:18212] ERROR: If the problem persists, please restart the
> [dealc12:18212] ERROR: Grid Engine PE job
> [dealc12:18212] The daemon received a signal 12.
>
> The job eventually stops, but the mpirun process itself continues to live
> (just the ppid changes).
>
> According to orte(1)/Signal Propagation, USR1 and USR2 should be propagated
> to all processes in the job (which seems to be happening), but why is a
> daemon start being attempted and the mpirun not being stopped?
>
> /mark
>
> This e-mail message and any attachments may contain legally privileged,
> confidential or proprietary Information, or information otherwise protected by
> law of ArvinMeritor, Inc., its affiliates, or third parties. This notice
> serves as marking of its „Confidential‰ status as defined in any
> confidentiality agreements concerning the sender and recipient. If you are not
> the intended recipient(s), or the employee or agent responsible for delivery
> of this message to the intended recipient(s), you are hereby notified that any
> dissemination, distribution or copying of this e-mail message is strictly
> prohibited. If you have received this message in error, please immediately
> notify the sender and delete this e-mail message from your computer.
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users