Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] openmpi 1.3 and gridengine tight integration problem
From: Salmon, Rene (salmr0_at_[hidden])
Date: 2009-03-17 23:51:28


Hi,

I have looked through the list archives and google but could not find anything related to what I am seeing. I am simply trying to run the basic cpi.c code using SGE and tight integration.

If run outside SGE i can run my jobs just fine:
hpcp7781(salmr0)132:mpiexec -np 2 --machinefile x a.out
Process 0 on hpcp7781
Process 1 on hpcp7782
pi is approximately 3.1416009869231241, Error is 0.0000083333333309
wall clock time = 0.032325

If I submit to SGE I get this:

[hpcp7781:08527] mca: base: components_open: Looking for plm components
[hpcp7781:08527] mca: base: components_open: opening plm components
[hpcp7781:08527] mca: base: components_open: found loaded component rsh
[hpcp7781:08527] mca: base: components_open: component rsh has no register function
[hpcp7781:08527] mca: base: components_open: component rsh open function successful
[hpcp7781:08527] mca: base: components_open: found loaded component slurm
[hpcp7781:08527] mca: base: components_open: component slurm has no register function
[hpcp7781:08527] mca: base: components_open: component slurm open function successful
[hpcp7781:08527] mca:base:select: Auto-selecting plm components
[hpcp7781:08527] mca:base:select:( plm) Querying component [rsh]
[hpcp7781:08527] [[INVALID],INVALID] plm:rsh: using /hpc/SGE/bin/lx24-amd64/qrsh for launching
[hpcp7781:08527] mca:base:select:( plm) Query of component [rsh] set priority to 10
[hpcp7781:08527] mca:base:select:( plm) Querying component [slurm]
[hpcp7781:08527] mca:base:select:( plm) Skipping component [slurm]. Query failed to return a module
[hpcp7781:08527] mca:base:select:( plm) Selected component [rsh]
[hpcp7781:08527] mca: base: close: component slurm closed
[hpcp7781:08527] mca: base: close: unloading component slurm
Starting server daemon at host "hpcp7782"
error: executing task of job 1702026 failed:
--------------------------------------------------------------------------
A daemon (pid 8528) died unexpectedly with status 1 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
mpirun: clean termination accomplished

[hpcp7781:08527] mca: base: close: component rsh closed
[hpcp7781:08527] mca: base: close: unloading component rsh

Seems to me orted is not starting on the remote node. I have LD_LIBRARY_PATH set on my shell startup files. If I do an ldd on orted i see this:

hpcp7781(salmr0)135:ldd /bphpc7/vol0/salmr0/ompi/bin/orted
        libopen-rte.so.0 => /bphpc7/vol0/salmr0/ompi/lib/libopen-rte.so.0 (0x00002ac5b14e2000)
        libopen-pal.so.0 => /bphpc7/vol0/salmr0/ompi/lib/libopen-pal.so.0 (0x00002ac5b1628000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00002ac5b17a9000)
        libnsl.so.1 => /lib64/libnsl.so.1 (0x00002ac5b18ad000)
        libutil.so.1 => /lib64/libutil.so.1 (0x00002ac5b19c4000)
        libm.so.6 => /lib64/libm.so.6 (0x00002ac5b1ac7000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00002ac5b1c1c000)
        libc.so.6 => /lib64/libc.so.6 (0x00002ac5b1d34000)
        /lib64/ld-linux-x86-64.so.2 (0x00002ac5b13c6000)

Looks like gridengine is using qrsh to start orted on the remote nodes. qrsh might not be reading my shell startup file and setting LD_LIBRARY_PATH.

Thanks for any help with this.

Rene