Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Openmpi with SGE
From: Pak Lui (Pak.Lui_at_[hidden])
Date: 2008-02-21 11:48:49


I am not quite sure. It seems that your AR (advance reservation)
snapshot3 build is a bit new, and it may be a problem coming from it. I
am not quite familiar with this new SGE feature. I'd ping the gridengine
list to check on that error message coming from execd.

Neeraj Chourasia wrote:
> Hello everyone,
>
> I am facing problem while calling mpirun in a loop when using with
> SGE. My sge version is SGE6.1AR_snapshot3. The script i am submitting
> via sge is
>
> xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
> let i=0
>
> while [ $i -lt 100 ]
> do
> echo
> "############################################################################################"
> echo "Iteration :$i"
> /usr/local/openmpi-1.2.4/bin/mpirun -np $NP -hostfile
> $TMP/machines send
> let "i+=1"
> echo
> "############################################################################################"
> done
> xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
>
> Now above script runs well for 15-20 iteration and then fails with
> following message
>
> -------------------------Error
> Message-------------------------------------------------------------------
> error: executing task of job 3869 failed: execution daemon on host
> "n101" didn't accept task
> [n199:11989] ERROR: A daemon on node n101 failed to start as expected.
> [n199:11989] ERROR: There may be more information available from
> [n199:11989] ERROR: the 'qstat -t' command on the Grid Engine tasks.
> [n199:11989] ERROR: If the problem persists, please restart the
> [n199:11989] ERROR: Grid Engine PE job
> [n199:11989] ERROR: The daemon exited unexpectedly with status 1.
> -----------------------------------------------------------------------------------------------------------
>
> When i do ssh to n101, there is no orted and qrsh_starter running. While
> checking its spool file, i came across following message
> -----------------------------------------------Execd spool Error
> Message---------------------------------
> |execd|n101|E|no free queue for job 3869 of user neeraj_at_n199 (localhost
> = n101)
> -----------------------------------------------------------------------------------------------------------------------
>
> What could be the reason for it.
> While checking the mailing list, i come across following link
> http://www.open-mpi.org/community/lists/users/2007/03/2771.php
> but, i dont think its the same problem. Any help is appreciated.
>
> Regards
> Neeraj
>
>
>
>
> Singapore Tour
> <http://adworks.rediff.com/cgi-bin/AdWorks/click.cgi/www.rediff.com/signature-home.htm/1050715198@Middle5/2041799_2034533/2041733/1?PARTNER=3&OAS_QUERY=null>
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
- Pak Lui
pak.lui_at_[hidden]