Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OpenMPI+SGE tight integration works on E6600 core duo systems but not on Q9550 quads
From: Reuti (reuti_at_[hidden])
Date: 2009-07-07 17:05:07


Hi,

Am 07.07.2009 um 22:12 schrieb Lengyel, Florian:

> Hi,
> I may have overlooked something in the archives (not to mention
> Googling)--if so I apologize, however
> I have been unable to find info on this particular problem.
>
> OpenMPI+SGE tight integration works on E6600 core duo systems but
> not on Q9550 quads.
> Could use some troubleshooting assistance. Thanks.
>
Is this what you found our your question?

I'm not aware of this. What should be the cause of it?!? Do you have
a link - was it on the SGE list?

-- Reuti

>
> I'm running SGE 6.0u10 on a linux cluster running OpenSuse 11.
>
> OpenMPI was compiled with SGE, and the required components are
> present:
>
> [flengyel_at_nept OPENMPI]$ ompi_info | grep gridengine
> MCA ras: gridengine (MCA v1.0, API v1.3, Component
> v1.2.7)
> MCA pls: gridengine (MCA v1.0, API v1.3, Component
> v1.2.7)
>
>
> The parallel execution environment for OpenMPI is as follows:
>
> [flengyel_at_nept OPENMPI]$ qconf -sp ompi
> pe_name ompi
> slots 999
> user_lists Research
> xuser_lists NONE
> start_proc_args /bin/true
> stop_proc_args /bin/true
> allocation_rule $fill_up
> control_slaves TRUE
> job_is_first_task FALSE
> urgency_slots min
>
> A trivial OpenMPI job using this pe will run on a queue for Intel
> E6600 core duo machines:
>
> [flengyel_at_nept OPENMPI]$ cat sum2.sh
>
> #!/bin/bash
> #$ -S /bin/bash
> #$ -q x86_64.q
> #$ -N sum
> #$ -pe ompi 4
>
> #$ -cwd
>
> export PATH=/home/nept/apps64/openmpi/bin:$PATH
> export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/nept/apps64/openmpi/lib
> . /usr/local/sge/default/common/settings.sh
> mpirun --mca pls_gridengine_verbose 2 --prefix /home/nept/apps64/
> openmpi -v ./sum
>
> Here are the results:
>
> [flengyel_at_nept OPENMPI]$ qsub sum2.sh
> Your job 23194 ("sum") has been submitted
>
> [flengyel_at_nept OPENMPI]$ qstat -r -u flengyel
>
> job-ID prior name user state submit/start at
> queue slots ja-task-ID
> ----------------------------------------------------------------------
> -------------------------------------------
> 23194 0.25007 sum flengyel r 07/07/2009 14:14:40
> x86_64.q_at_[hidden] 4
> Full jobname: sum
> Master queue: x86_64.q_at_[hidden]
> Requested PE: ompi 4
> Granted PE: ompi 4
> Hard Resources:
> Soft Resources:
> Hard requested queues: x86_64.q
>
>
> [flengyel_at_nept OPENMPI]$ more sum.o23194
>
> The sum from 1 to 1000 is: 500500
> [flengyel_at_nept OPENMPI]$ more sum.e23194
> Starting server daemon at host "m49.gc.cuny.edu"
> Starting server daemon at host "m33.gc.cuny.edu"
> Server daemon successfully started with task id "1.m49"
> Establishing /usr/local/sge/utilbin/lx24-amd64/rsh session to host
> m49.gc.cuny.edu ...
> Server daemon successfully started with task id "1.m33"
> Establishing /usr/local/sge/utilbin/lx24-amd64/rsh session to host
> m33.gc.cuny.edu ...
> /usr/local/sge/utilbin/lx24-amd64/rsh exited with exit code 0
> reading exit code from shepherd ...
>
> But the same job with the queue set to quad.q for the Q9550 quad
> core machines
> has daemon trouble:
>
>
> [flengyel_at_nept OPENMPI]$ !qstat
> qstat -r -u flengyel
> job-ID prior name user state submit/start at
> queue slots ja-task-ID
> ----------------------------------------------------------------------
> -------------------------------------------
> 23196 0.25000 sum flengyel r 07/07/2009 14:26:21
> quad.q_at_[hidden] 2
> Full jobname: sum
> Master queue: quad.q_at_[hidden]
> Requested PE: ompi 2
> Granted PE: ompi 2
> Hard Resources:
> Soft Resources:
> Hard requested queues: quad.q
> [flengyel_at_nept OPENMPI]$ more sum.e23196
> Starting server daemon at host "m15.gc.cuny.edu"
> Starting server daemon at host "m09.gc.cuny.edu"
> Server daemon successfully started with task id "1.m15"
> Server daemon successfully started with task id "1.m09"
> Establishing /usr/local/sge/utilbin/lx24-amd64/rsh session to host
> m15.gc.cuny.e
> du ...
> /usr/local/sge/utilbin/lx24-amd64/rsh exited on signal 13 (PIPE)
> reading exit code from shepherd ... Establishing /usr/local/sge/
> utilbin/lx24-amd
> 64/rsh session to host m09.gc.cuny.edu ...
> /usr/local/sge/utilbin/lx24-amd64/rsh exited on signal 13 (PIPE)
> reading exit code from shepherd ... 129
> [m09.gc.cuny.edu:11413] ERROR: A daemon on node m15.gc.cuny.edu
> failed to start
> as expected.
> [m09.gc.cuny.edu:11413] ERROR: There may be more information
> available from
> [m09.gc.cuny.edu:11413] ERROR: the 'qstat -t' command on the Grid
> Engine tasks.
> [m09.gc.cuny.edu:11413] ERROR: If the problem persists, please
> restart the
> [m09.gc.cuny.edu:11413] ERROR: Grid Engine PE job
> [m09.gc.cuny.edu:11413] ERROR: The daemon exited unexpectedly with
> status 129.
> 129
> [m09.gc.cuny.edu:11413] ERROR: A daemon on node m09.gc.cuny.edu
> failed to start
> as expected.
> [m09.gc.cuny.edu:11413] ERROR: There may be more information
> available from
> [m09.gc.cuny.edu:11413] ERROR: the 'qstat -t' command on the Grid
> Engine tasks.
> [m09.gc.cuny.edu:11413] ERROR: If the problem persists, please
> restart the
> [m09.gc.cuny.edu:11413] ERROR: Grid Engine PE job
> [m09.gc.cuny.edu:11413] ERROR: The daemon exited unexpectedly with
> status 129.
> [flengyel_at_nept OPENMPI]$
>
>
> -FL
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?
> dsForumId=38&dsMessageId=206057
>
> To unsubscribe from this discussion, e-mail: [users-
> unsubscribe_at_[hidden]].
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users