Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] problem w sge 6.2 & openmpi
From: Rolf Vandevaart (Rolf.Vandevaart_at_[hidden])
Date: 2009-08-05 16:52:38


I assume it is working with np=8 because the 8 processes are getting
launched on the same node as mpirun and therefore there is no call to
qrsh to start up any remote processes. When you go beyond 8, mpirun
calls qrsh to start up processes on some of the remote nodes.

I would suggest first that you replace your MPI program with just
hostname to simplify debug. Then maybe you can forward along your qsub
script as well as what your PE environment looks like (qconf -sp PE_NAME
--- where PE_NAME is the name of your parallel environemnt).

Rolf

Eli Morris wrote:
> Hi guys,
>
> I'm trying to run an example program, mpi-ring, on a rocks cluster.
> When launched via sge with 8 processors (we have 8 procs per node),
> the program works fine, but with any more processors and the program
> fails.
> I'm using open-mpi 1.3.2, included below, at end of post, is output of
> ompi_info -all
>
> Any help with this vexing problem is appreciated.
>
> thanks,
>
> Eli
>
> [emorris_at_nimbus ~/test]$ echo $LD_LIBRARY_PATH
> /opt/openmpi/lib:/lib:/usr/lib:/share/apps/sunstudio/rtlibs
> [emorris_at_nimbus ~/test]$ echo $PATH
> /opt/openmpi/bin:/share/apps/sunstudio/bin:/opt/ncl/bin:/home/tobrien/scripts:/usr/java/latest/bin:/opt/local/grads/bin:/share/apps/openmpilib/bin:/opt/local/ncl/ncl/bin:/opt/gridengine/bin/lx26-amd64:/usr/java/latest/bin:/opt/gridengine/bin/lx26-amd64:/usr/kerberos/bin:/opt/gridengine/bin/lx26-amd64:/usr/java/latest/bin:/usr/local/bin:/bin:/usr/bin:/opt/eclipse:/opt/ganglia/bin:/opt/ganglia/sbin:/opt/maven/bin:/opt/openmpi/bin/:/opt/rocks/bin:/opt/rocks/sbin:/home/emorris/.sage/bin:/opt/eclipse:/opt/ganglia/bin:/opt/ganglia/sbin:/opt/maven/bin:/opt/openmpi/bin/:/opt/rocks/bin:/opt/rocks/sbin:/home/emorris/.sage/bin
>
> [emorris_at_nimbus ~/test]$
>
> Here is the mpirun command from the script:
>
> /opt/openmpi/bin/mpirun --debug-daemons --mca plm_base_verbose 40 -mca
> plm_rsh_agent ssh -np $NSLOTS $HOME/test/mpi-ring
>
> Here is the verbose output of a successful program start and failure:
>
>
>
> Success:
>
> [root_at_nimbus test]# more mpi-ring.qsub.o246
> [compute-0-11.local:32126] mca: base: components_open: Looking for plm
> components
> [compute-0-11.local:32126] mca: base: components_open: opening plm
> components
> [compute-0-11.local:32126] mca: base: components_open: found loaded
> component rsh
> [compute-0-11.local:32126] mca: base: components_open: component rsh
> has no register function
> [compute-0-11.local:32126] mca: base: components_open: component rsh
> open function successful
> [compute-0-11.local:32126] mca: base: components_open: found loaded
> component slurm
> [compute-0-11.local:32126] mca: base: components_open: component slurm
> has no register function
> [compute-0-11.local:32126] mca: base: components_open: component slurm
> open function successful
> [compute-0-11.local:32126] mca:base:select: Auto-selecting plm components
> [compute-0-11.local:32126] mca:base:select:( plm) Querying component
> [rsh]
> [compute-0-11.local:32126] [[INVALID],INVALID] plm:rsh: using
> /opt/gridengine/bin/lx26-amd64/qrsh for launching
> [compute-0-11.local:32126] mca:base:select:( plm) Query of component
> [rsh] set priority to 10
> [compute-0-11.local:32126] mca:base:select:( plm) Querying component
> [slurm]
> [compute-0-11.local:32126] mca:base:select:( plm) Skipping component
> [slurm]. Query failed to return a module
> [compute-0-11.local:32126] mca:base:select:( plm) Selected component
> [rsh]
> [compute-0-11.local:32126] mca: base: close: component slurm closed
> [compute-0-11.local:32126] mca: base: close: unloading component slurm
> [compute-0-11.local:32126] [[22715,0],0] node[0].name compute-0-11
> daemon 0 arch ffc91200
> [compute-0-11.local:32126] [[22715,0],0] orted_cmd: received
> add_local_procs
> [compute-0-11.local:32126] [[22715,0],0] orted_recv: received
> sync+nidmap from local proc [[22715,1],1]
> [compute-0-11.local:32126] [[22715,0],0] orted_recv: received
> sync+nidmap from local proc [[22715,1],0]
> [compute-0-11.local:32126] [[22715,0],0] orted_cmd: received
> collective data cmd
> [compute-0-11.local:32126] [[22715,0],0] orted_cmd: received
> collective data cmd
> .
> .
> .
>
> failure:
>
> [root_at_nimbus test]# more mpi-ring.qsub.o244
> [compute-0-14.local:31175] mca:base:select:( plm) Querying component
> [rsh]
> [compute-0-14.local:31175] [[INVALID],INVALID] plm:rsh: using
> /opt/gridengine/bin/lx26-amd64/qrsh for launc
> hing
> [compute-0-14.local:31175] mca:base:select:( plm) Query of component
> [rsh] set priority to 10
> [compute-0-14.local:31175] mca:base:select:( plm) Querying component
> [slurm]
> [compute-0-14.local:31175] mca:base:select:( plm) Skipping component
> [slurm]. Query failed to return a mod
> ule
> [compute-0-14.local:31175] mca:base:select:( plm) Selected component
> [rsh]
> Starting server daemon at host "compute-0-6.local"
> Server daemon successfully started with task id "1.compute-0-6"
> error: error: ending connection before all data received
> error:
> error reading job context from "qlogin_starter"
> --------------------------------------------------------------------------
>
> A daemon (pid 31176) died unexpectedly with status 1 while attempting
> to launch so we are aborting.
>
> There may be more information reported by the environment (see above).
>
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have
> the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
>
>
> [...snip...]

-- 
=========================
rolf.vandevaart_at_[hidden]
781-442-3043
=========================