Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] openmpi 1.3 and gridengine tight integrationproblem
From: Rene Salmon (salmr0_at_[hidden])
Date: 2009-03-18 09:25:14


Hi,

Thanks for the help. I only use the machine file to run outside of SGE
just to test/prove that things work outside of SGE.

When I run with in SGE here is what the job script looks like:

hpcp7781(salmr0)128:cat simple-job.sh
#!/bin/csh
#
#$ -S /bin/csh
setenv LD_LIBRARY_PATH /bphpc7/vol0/salmr0/ompi/lib
mpirun --mca plm_base_verbose 20 --prefix /bphpc7/vol0/salmr0/ompi -np
16 /bphpc7/vol0/salmr0/SGE/a.out

We are using PEs. Here is what the PE looks like:

hpcp7781(salmr0)129:qconf -sp pavtest
pe_name pavtest
slots 16
user_lists NONE
xuser_lists NONE
start_proc_args /bin/true
stop_proc_args /bin/true
allocation_rule 8
control_slaves FALSE
job_is_first_task FALSE
urgency_slots min

here is he qsub line to submit the job:

>>qsub -pe pavtest 16 simple-job.sh

The job seems to run fine with no problems with in SGE if I contain the
job with in one node. As soon as the job has to use more than one one
things stop working with the message I posted about LD_LIBRARY_PATH and
orted seems not to start on the remote nodes.

Thanks
Rene

On Wed, 2009-03-18 at 07:45 +0000, Reuti wrote:
> Hi,
>
> it shouldn't be necessary to supply a machinefile, as the one
> generated by SGE is taken automatically (i.e. the granted nodes are
> honored). You submitted the job requesting a PE?
>
> -- Reuti
>
>
> Am 18.03.2009 um 04:51 schrieb Salmon, Rene:
>
> >
> > Hi,
> >
> > I have looked through the list archives and google but could not
> > find anything related to what I am seeing. I am simply trying to
> > run the basic cpi.c code using SGE and tight integration.
> >
> > If run outside SGE i can run my jobs just fine:
> > hpcp7781(salmr0)132:mpiexec -np 2 --machinefile x a.out
> > Process 0 on hpcp7781
> > Process 1 on hpcp7782
> > pi is approximately 3.1416009869231241, Error is 0.0000083333333309
> > wall clock time = 0.032325
> >
> >
> > If I submit to SGE I get this:
> >
> > [hpcp7781:08527] mca: base: components_open: Looking for plm
> > components
> > [hpcp7781:08527] mca: base: components_open: opening plm components
> > [hpcp7781:08527] mca: base: components_open: found loaded component
> > rsh
> > [hpcp7781:08527] mca: base: components_open: component rsh has no
> > register function
> > [hpcp7781:08527] mca: base: components_open: component rsh open
> > function successful
> > [hpcp7781:08527] mca: base: components_open: found loaded component
> > slurm
> > [hpcp7781:08527] mca: base: components_open: component slurm has no
> > register function
> > [hpcp7781:08527] mca: base: components_open: component slurm open
> > function successful
> > [hpcp7781:08527] mca:base:select: Auto-selecting plm components
> > [hpcp7781:08527] mca:base:select:( plm) Querying component [rsh]
> > [hpcp7781:08527] [[INVALID],INVALID] plm:rsh: using /hpc/SGE/bin/
> > lx24-amd64/qrsh for launching
> > [hpcp7781:08527] mca:base:select:( plm) Query of component [rsh]
> > set priority to 10
> > [hpcp7781:08527] mca:base:select:( plm) Querying component [slurm]
> > [hpcp7781:08527] mca:base:select:( plm) Skipping component
> > [slurm]. Query failed to return a module
> > [hpcp7781:08527] mca:base:select:( plm) Selected component [rsh]
> > [hpcp7781:08527] mca: base: close: component slurm closed
> > [hpcp7781:08527] mca: base: close: unloading component slurm
> > Starting server daemon at host "hpcp7782"
> > error: executing task of job 1702026 failed:
> >
> ----------------------------------------------------------------------
> > ----
> > A daemon (pid 8528) died unexpectedly with status 1 while attempting
> > to launch so we are aborting.
> >
> > There may be more information reported by the environment (see
> above).
> >
> > This may be because the daemon was unable to find all the needed
> > shared
> > libraries on the remote node. You may set your LD_LIBRARY_PATH to
> > have the
> > location of the shared libraries on the remote nodes and this will
> > automatically be forwarded to the remote nodes.
> >
> ----------------------------------------------------------------------
> > ----
> >
> ----------------------------------------------------------------------
> > ----
> > mpirun noticed that the job aborted, but has no info as to the
> process
> > that caused that situation.
> >
> ----------------------------------------------------------------------
> > ----
> > mpirun: clean termination accomplished
> >
> > [hpcp7781:08527] mca: base: close: component rsh closed
> > [hpcp7781:08527] mca: base: close: unloading component rsh
> >
> >
> >
> >
> > Seems to me orted is not starting on the remote node. I have
> > LD_LIBRARY_PATH set on my shell startup files. If I do an ldd on
> > orted i see this:
> >
> > hpcp7781(salmr0)135:ldd /bphpc7/vol0/salmr0/ompi/bin/orted
> > libopen-rte.so.0 => /bphpc7/vol0/salmr0/ompi/lib/libopen-
> > rte.so.0 (0x00002ac5b14e2000)
> > libopen-pal.so.0 => /bphpc7/vol0/salmr0/ompi/lib/libopen-
> > pal.so.0 (0x00002ac5b1628000)
> > libdl.so.2 => /lib64/libdl.so.2 (0x00002ac5b17a9000)
> > libnsl.so.1 => /lib64/libnsl.so.1 (0x00002ac5b18ad000)
> > libutil.so.1 => /lib64/libutil.so.1 (0x00002ac5b19c4000)
> > libm.so.6 => /lib64/libm.so.6 (0x00002ac5b1ac7000)
> > libpthread.so.0 => /lib64/libpthread.so.0
> (0x00002ac5b1c1c000)
> > libc.so.6 => /lib64/libc.so.6 (0x00002ac5b1d34000)
> > /lib64/ld-linux-x86-64.so.2 (0x00002ac5b13c6000)
> >
> >
> > Looks like gridengine is using qrsh to start orted on the remote
> > nodes. qrsh might not be reading my shell startup file and setting
> > LD_LIBRARY_PATH.
> >
> > Thanks for any help with this.
> >
> > Rene
> >
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>