Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] Openmpi with SGE
From: Neeraj Chourasia (neeraj_ch1_at_[hidden])
Date: 2008-02-19 06:49:17


Hello everyone,    I am facing problem while calling mpirun in a loop when using with SGE. My sge version is SGE6.1AR_snapshot3. The script i am submitting via sge is xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxlet i=0while [ $i -lt 100 ]do        echo "############################################################################################"        echo "Iteration :$i"        /usr/local/openmpi-1.2.4/bin/mpirun -np $NP -hostfile $TMP/machines send        let "i+=1"        echo "############################################################################################"donexxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxNow above script runs well for 15-20 iteration and then fails with following message-------------------------Error Message----------
---------------------------------------------------------error: executing task of job 3869 failed: execution daemon on host "n101" didn't accept task[n199:11989] ERROR: A daemon on node n101 failed to start as expected.[n199:11989] ERROR: There may be more information available from[n199:11989] ERROR: the 'qstat -t' command on the Grid Engine tasks.[n199:11989] ERROR: If the problem persists, please restart the[n199:11989] ERROR: Grid Engine PE job[n199:11989] ERROR: The daemon exited unexpectedly with status 1.-----------------------------------------------------------------------------------------------------------When i do ssh to n101, there is no orted and qrsh_starter running. While checking its spool file, i came across following message-----------------------------------------------Execd spool Error Message---------------------------------|execd|n101|E|no free queue for job 3869 of user neeraj_at_n199 (localhost = n101)--------------------------------------------------------------------------------------
-
--------------------------------What could be the reason for it.While checking the mailing list, i come across following link        http://www.open-mpi.org/community/lists/users/2007/03/2771.phpbut, i dont think its the same problem. Any help is appreciated.RegardsNeeraj