Hi Reuti (and others),
> And now the odd thing: the jobscript (with the mpirun) is gone on the
> head node of this parallel job, but all the spawned qrsh processes
> are still there:
I'm glad that someone else can almost reproduce my problem.
On the suspicion that my application was not ignoring usr1/usr2, I added a
signal handler that simply outputs "ignoring SIGUSR*". The shell script now
trap 'echo script usr1' USR1
trap 'echo script usr2' USR2
> So in the SGE case: usr1 should be caught by the mpirun (and not
> terminate it), which will notify the daemons to stop each ones child
> processes. This would simulate a real suspend, performed by OpenMPI.
Using qmod -sj to suspend the job (sending the usr1 warning signal), I have
the same behaviour as before. Interestingly enough, I get two messages:
mpirun: Forwarding signal 10 to job
The daemon received a signal 10.
After these messages, only the sge-shepherd and mpirun are alive - the job
and qrsh processes are gone. Some time later, the following message also
mpirun: Forwarding signal 12 to job
after which, no processes are left, *except* the mpirun, which I need to
kill by hand.
In case the configuration is a factor, the cluster machines are running with
a stock SuSE 9.2 (Linux 2.6.8-24-smp and/or 2.6.8-24.16-smp).
The openmpi configuration:
This e-mail message and any attachments may contain legally privileged, confidential or proprietary Information, or information otherwise protected by law of ArvinMeritor, Inc., its affiliates, or third parties. This notice serves as marking of its Confidential status as defined in any confidentiality agreements concerning the sender and recipient. If you are not the intended recipient(s), or the employee or agent responsible for delivery of this message to the intended recipient(s), you are hereby notified that any dissemination, distribution or copying of this e-mail message is strictly prohibited. If you have received this message in error, please immediately notify the sender and delete this e-mail message from your computer.