Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Pak Lui (Pak.Lui_at_[hidden])
Date: 2007-07-23 15:29:41


Hi Henk,

SLIM H.A. wrote:
> Dear Pak Lui
>
> I can delete the (sge) job with qdel -f such that it disappears from the
> job list but the application processes keep running, including the
> shepherds. I have to kill them with -15
>
> For some reason the kill -15 does not reach mpirun. (We use such a
> parameter to mpirun on our myrinet mx nodes with mpich, that's why I
> asked).

I believe qdel would send a SIGKILL to mpirun instead of a SIGTERM
(-15), that is why you don't see the signal reaches mpirun. Since there
is no way to catch a SIGKILL so that maybe why the orted and the
processes would keep running.

Hmm, this actually reminds me of a related problem. That is with the
qsub -notify option does not work as it intended under ORTE. The qsub
-notify option supposed to send a SIGUSR2 to mpirun and the processes
for an impending SIGKILL N seconds before it actually happens. However,
we don't catch SIGUSR2 signal in ORTE specifically for SGE (or the
gridengine modules), therefore user would see the mpirun and orted exit
before the user apps can catch the SIGUSR signal. I should file a trac
bug against this SGE feature we don't yet support and fix it sometime in
the future.

So back to your problem. Although this is unintended, maybe you can try
to run the job with qsub -notify for the mean time until we change for
above, since it will send a SIGUSR2 to mpirun, which should terminate
the mpirun, orted and the user processes in a way that is more
gracefully than qdel (or SIGKILL), because SIGKILL would not allow orted
to kill off the user processes, as SIGTERM or SIGUSR1/2 would.

>
> Just to confirm, there is no configure directive specific to gridengine
> when building openmpi?

Right, there isn't any configure directives currently.

>
> Thanks
>
> henk
>
>> -----Original Message-----
>> From: users-bounces_at_[hidden]
>> [mailto:users-bounces_at_[hidden]] On Behalf Of Pak Lui
>> Sent: 23 July 2007 15:16
>> To: Open MPI Users
>> Subject: Re: [OMPI users] sge qdel fails
>>
>> Hi Henk,
>>
>> The sge script should not require any extra parameter. The
>> qdel command should send the kill signal to mpirun and also
>> remove the SGE allocated tmp directory (in something like
>> /tmp/174.1.all.q/) which contains the OMPI session dir for
>> the running job, and in turns would cause orted and the user
>> processes to exit.
>>
>> Maybe you could try qdel -f <jid> to force delete from the
>> sge_qmaster, in case when sge_execd does not respond to the
>> delete request by the sge_qmaster?
>>
>> SLIM H.A. wrote:
>>> I am using OpenMPI 1.2.3 with SGE 6.0u7 over InfiniBand (OFED 1.2),
>>> following the recommendation in the OpenMPI FAQ
>>>
>>> http://www.open-mpi.org/faq/?category=running#run-n1ge-or-sge
>>>
>>> The job runs but when the user wants to delete the job with
>> the qdel
>>> command, this fails. Does the mpirun command
>>>
>>> mpirun -np $NSLOTS ./exe
>>>
>>> in the sge script require extra parameters?
>>>
>>> Thanks for any advice
>>>
>>> Henk
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> --
>>
>> - Pak Lui
>> pak.lui_at_[hidden]
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
- Pak Lui
pak.lui_at_[hidden]