Hi,
I may have overlooked something in the archives (not to mention Googling)--if so I apologize, however
I have been unable to find info on this particular problem.

OpenMPI+SGE tight integration works on E6600 core duo systems but not on Q9550 quads.
Could use some troubleshooting assistance. Thanks.

I'm running SGE 6.0u10 on a linux cluster running OpenSuse 11.

OpenMPI was compiled with SGE, and the required components are present:

[flengyel@nept OPENMPI]$ ompi_info | grep gridengine
                 MCA ras: gridengine (MCA v1.0, API v1.3, Component v1.2.7)
                 MCA pls: gridengine (MCA v1.0, API v1.3, Component v1.2.7)


The parallel execution environment for OpenMPI is as follows:

[flengyel@nept OPENMPI]$ qconf -sp ompi
pe_name           ompi
slots             999
user_lists        Research
xuser_lists       NONE
start_proc_args   /bin/true
stop_proc_args    /bin/true
allocation_rule   $fill_up
control_slaves    TRUE
job_is_first_task FALSE
urgency_slots     min

A trivial OpenMPI job using this pe will run on a queue for Intel E6600 core duo machines:

[flengyel@nept OPENMPI]$ cat sum2.sh

#!/bin/bash
#$ -S /bin/bash
#$ -q x86_64.q
#$ -N sum
#$ -pe ompi 4

#$ -cwd

export PATH=/home/nept/apps64/openmpi/bin:$PATH
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/nept/apps64/openmpi/lib
. /usr/local/sge/default/common/settings.sh
mpirun --mca pls_gridengine_verbose 2  --prefix /home/nept/apps64/openmpi -v  ./sum

Here are the results:

[flengyel@nept OPENMPI]$ qsub sum2.sh
Your job 23194 ("sum") has been submitted

[flengyel@nept OPENMPI]$ qstat -r -u flengyel

job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
  23194 0.25007 sum        flengyel     r     07/07/2009 14:14:40 x86_64.q@m49.gc.cuny.edu           4      
       Full jobname:     sum
       Master queue:     x86_64.q@m49.gc.cuny.edu
       Requested PE:     ompi 4
       Granted PE:       ompi 4
       Hard Resources: 
       Soft Resources: 
       Hard requested queues: x86_64.q


[flengyel@nept OPENMPI]$ more sum.o23194

The sum from 1 to 1000 is: 500500
[flengyel@nept OPENMPI]$ more sum.e23194
Starting server daemon at host "m49.gc.cuny.edu"
Starting server daemon at host "m33.gc.cuny.edu"
Server daemon successfully started with task id "1.m49"
Establishing /usr/local/sge/utilbin/lx24-amd64/rsh session to host m49.gc.cuny.edu ...
Server daemon successfully started with task id "1.m33"
Establishing /usr/local/sge/utilbin/lx24-amd64/rsh session to host m33.gc.cuny.edu ...
/usr/local/sge/utilbin/lx24-amd64/rsh exited with exit code 0
reading exit code from shepherd ...

But the same job with the queue set to quad.q for the Q9550 quad core machines
has daemon trouble:


[flengyel@nept OPENMPI]$ !qstat
qstat -r -u flengyel
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
  23196 0.25000 sum        flengyel     r     07/07/2009 14:26:21 quad.q@m09.gc.cuny.edu             2       
       Full jobname:     sum
       Master queue:     quad.q@m09.gc.cuny.edu
       Requested PE:     ompi 2
       Granted PE:       ompi 2
       Hard Resources:  
       Soft Resources:  
       Hard requested queues: quad.q
[flengyel@nept OPENMPI]$ more sum.e23196
Starting server daemon at host "m15.gc.cuny.edu"
Starting server daemon at host "m09.gc.cuny.edu"
Server daemon successfully started with task id "1.m15"
Server daemon successfully started with task id "1.m09"
Establishing /usr/local/sge/utilbin/lx24-amd64/rsh session to host m15.gc.cuny.e
du ...
/usr/local/sge/utilbin/lx24-amd64/rsh exited on signal 13 (PIPE)
reading exit code from shepherd ... Establishing /usr/local/sge/utilbin/lx24-amd
64/rsh session to host m09.gc.cuny.edu ...
/usr/local/sge/utilbin/lx24-amd64/rsh exited on signal 13 (PIPE)
reading exit code from shepherd ... 129
[m09.gc.cuny.edu:11413] ERROR: A daemon on node m15.gc.cuny.edu failed to start
as expected.
[m09.gc.cuny.edu:11413] ERROR: There may be more information available from
[m09.gc.cuny.edu:11413] ERROR: the 'qstat -t' command on the Grid Engine tasks.
[m09.gc.cuny.edu:11413] ERROR: If the problem persists, please restart the
[m09.gc.cuny.edu:11413] ERROR: Grid Engine PE job
[m09.gc.cuny.edu:11413] ERROR: The daemon exited unexpectedly with status 129.
129
[m09.gc.cuny.edu:11413] ERROR: A daemon on node m09.gc.cuny.edu failed to start
as expected.
[m09.gc.cuny.edu:11413] ERROR: There may be more information available from
[m09.gc.cuny.edu:11413] ERROR: the 'qstat -t' command on the Grid Engine tasks.
[m09.gc.cuny.edu:11413] ERROR: If the problem persists, please restart the
[m09.gc.cuny.edu:11413] ERROR: Grid Engine PE job
[m09.gc.cuny.edu:11413] ERROR: The daemon exited unexpectedly with status 129.
[flengyel@nept OPENMPI]$


-FL

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=206057

To unsubscribe from this discussion, e-mail: [users-unsubscribe@gridengine.sunsource.net].