Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] openmpi 1.3 and gridengine tight integration problem
From: Reuti (reuti_at_[hidden])
Date: 2009-03-18 03:45:33


Hi,

it shouldn't be necessary to supply a machinefile, as the one
generated by SGE is taken automatically (i.e. the granted nodes are
honored). You submitted the job requesting a PE?

-- Reuti

Am 18.03.2009 um 04:51 schrieb Salmon, Rene:

>
> Hi,
>
> I have looked through the list archives and google but could not
> find anything related to what I am seeing. I am simply trying to
> run the basic cpi.c code using SGE and tight integration.
>
> If run outside SGE i can run my jobs just fine:
> hpcp7781(salmr0)132:mpiexec -np 2 --machinefile x a.out
> Process 0 on hpcp7781
> Process 1 on hpcp7782
> pi is approximately 3.1416009869231241, Error is 0.0000083333333309
> wall clock time = 0.032325
>
>
> If I submit to SGE I get this:
>
> [hpcp7781:08527] mca: base: components_open: Looking for plm
> components
> [hpcp7781:08527] mca: base: components_open: opening plm components
> [hpcp7781:08527] mca: base: components_open: found loaded component
> rsh
> [hpcp7781:08527] mca: base: components_open: component rsh has no
> register function
> [hpcp7781:08527] mca: base: components_open: component rsh open
> function successful
> [hpcp7781:08527] mca: base: components_open: found loaded component
> slurm
> [hpcp7781:08527] mca: base: components_open: component slurm has no
> register function
> [hpcp7781:08527] mca: base: components_open: component slurm open
> function successful
> [hpcp7781:08527] mca:base:select: Auto-selecting plm components
> [hpcp7781:08527] mca:base:select:( plm) Querying component [rsh]
> [hpcp7781:08527] [[INVALID],INVALID] plm:rsh: using /hpc/SGE/bin/
> lx24-amd64/qrsh for launching
> [hpcp7781:08527] mca:base:select:( plm) Query of component [rsh]
> set priority to 10
> [hpcp7781:08527] mca:base:select:( plm) Querying component [slurm]
> [hpcp7781:08527] mca:base:select:( plm) Skipping component
> [slurm]. Query failed to return a module
> [hpcp7781:08527] mca:base:select:( plm) Selected component [rsh]
> [hpcp7781:08527] mca: base: close: component slurm closed
> [hpcp7781:08527] mca: base: close: unloading component slurm
> Starting server daemon at host "hpcp7782"
> error: executing task of job 1702026 failed:
> ----------------------------------------------------------------------
> ----
> A daemon (pid 8528) died unexpectedly with status 1 while attempting
> to launch so we are aborting.
>
> There may be more information reported by the environment (see above).
>
> This may be because the daemon was unable to find all the needed
> shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to
> have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> ----------------------------------------------------------------------
> ----
> ----------------------------------------------------------------------
> ----
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> ----------------------------------------------------------------------
> ----
> mpirun: clean termination accomplished
>
> [hpcp7781:08527] mca: base: close: component rsh closed
> [hpcp7781:08527] mca: base: close: unloading component rsh
>
>
>
>
> Seems to me orted is not starting on the remote node. I have
> LD_LIBRARY_PATH set on my shell startup files. If I do an ldd on
> orted i see this:
>
> hpcp7781(salmr0)135:ldd /bphpc7/vol0/salmr0/ompi/bin/orted
> libopen-rte.so.0 => /bphpc7/vol0/salmr0/ompi/lib/libopen-
> rte.so.0 (0x00002ac5b14e2000)
> libopen-pal.so.0 => /bphpc7/vol0/salmr0/ompi/lib/libopen-
> pal.so.0 (0x00002ac5b1628000)
> libdl.so.2 => /lib64/libdl.so.2 (0x00002ac5b17a9000)
> libnsl.so.1 => /lib64/libnsl.so.1 (0x00002ac5b18ad000)
> libutil.so.1 => /lib64/libutil.so.1 (0x00002ac5b19c4000)
> libm.so.6 => /lib64/libm.so.6 (0x00002ac5b1ac7000)
> libpthread.so.0 => /lib64/libpthread.so.0 (0x00002ac5b1c1c000)
> libc.so.6 => /lib64/libc.so.6 (0x00002ac5b1d34000)
> /lib64/ld-linux-x86-64.so.2 (0x00002ac5b13c6000)
>
>
> Looks like gridengine is using qrsh to start orted on the remote
> nodes. qrsh might not be reading my shell startup file and setting
> LD_LIBRARY_PATH.
>
> Thanks for any help with this.
>
> Rene
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users