Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] openmpi 1.3 and gridengine tight integrationproblem
From: Rene Salmon (salmr0_at_[hidden])
Date: 2009-03-18 10:02:02


>
> aha. Did you compile Open MPI 1.3 with the SGE option?
>

Yes I did.

hpcp7781(salmr0)142:ompi_info |grep grid
                 MCA ras: gridengine (MCA v2.0, API v2.0, Component
v1.3)

>
> > setenv LD_LIBRARY_PATH /bphpc7/vol0/salmr0/ompi/lib
>
> Maybe you have to set this LD_LIBRARY_PATH in your .cshrc, so it's
> known automatically on the nodes.
>

Yes. I also have "setenv LD_LIBRARY_PATH /bphpc7/vol0/salmr0/ompi/lib"
on my .cshrc as well. I just wanted to make double sure that it was
there.

I also even tried putting "/bphpc7/vol0/salmr0/ompi/lib"
in /etc/ld.so.conf system wide just to test and see if that would help
but still same results.

> > mpirun --mca plm_base_verbose 20 --prefix /bphpc7/vol0/salmr0/ompi
> -np
> > 16 /bphpc7/vol0/salmr0/SGE/a.out
>
> Do you use --mca... only for debugging or why is it added here?
>

I only put that there for debugging. Is there a different flag I should
use to get more debug info?

Thanks
Rene

> -- Reuti
>
>
> >
> > We are using PEs. Here is what the PE looks like:
> >
> > hpcp7781(salmr0)129:qconf -sp pavtest
> > pe_name pavtest
> > slots 16
> > user_lists NONE
> > xuser_lists NONE
> > start_proc_args /bin/true
> > stop_proc_args /bin/true
> > allocation_rule 8
> > control_slaves FALSE
> > job_is_first_task FALSE
> > urgency_slots min
> >
> >
> > here is he qsub line to submit the job:
> >
> >>> qsub -pe pavtest 16 simple-job.sh
> >
> >
> > The job seems to run fine with no problems with in SGE if I contain
> > the
> > job with in one node. As soon as the job has to use more than one
> one
> > things stop working with the message I posted about LD_LIBRARY_PATH
> > and
> > orted seems not to start on the remote nodes.
> >
> > Thanks
> > Rene
> >
> >
> >
> >
> > On Wed, 2009-03-18 at 07:45 +0000, Reuti wrote:
> >> Hi,
> >>
> >> it shouldn't be necessary to supply a machinefile, as the one
> >> generated by SGE is taken automatically (i.e. the granted nodes are
> >> honored). You submitted the job requesting a PE?
> >>
> >> -- Reuti
> >>
> >>
> >> Am 18.03.2009 um 04:51 schrieb Salmon, Rene:
> >>
> >>>
> >>> Hi,
> >>>
> >>> I have looked through the list archives and google but could not
> >>> find anything related to what I am seeing. I am simply trying to
> >>> run the basic cpi.c code using SGE and tight integration.
> >>>
> >>> If run outside SGE i can run my jobs just fine:
> >>> hpcp7781(salmr0)132:mpiexec -np 2 --machinefile x a.out
> >>> Process 0 on hpcp7781
> >>> Process 1 on hpcp7782
> >>> pi is approximately 3.1416009869231241, Error is
> 0.0000083333333309
> >>> wall clock time = 0.032325
> >>>
> >>>
> >>> If I submit to SGE I get this:
> >>>
> >>> [hpcp7781:08527] mca: base: components_open: Looking for plm
> >>> components
> >>> [hpcp7781:08527] mca: base: components_open: opening plm
> components
> >>> [hpcp7781:08527] mca: base: components_open: found loaded
> component
> >>> rsh
> >>> [hpcp7781:08527] mca: base: components_open: component rsh has no
> >>> register function
> >>> [hpcp7781:08527] mca: base: components_open: component rsh open
> >>> function successful
> >>> [hpcp7781:08527] mca: base: components_open: found loaded
> component
> >>> slurm
> >>> [hpcp7781:08527] mca: base: components_open: component slurm has
> no
> >>> register function
> >>> [hpcp7781:08527] mca: base: components_open: component slurm open
> >>> function successful
> >>> [hpcp7781:08527] mca:base:select: Auto-selecting plm components
> >>> [hpcp7781:08527] mca:base:select:( plm) Querying component [rsh]
> >>> [hpcp7781:08527] [[INVALID],INVALID] plm:rsh: using /hpc/SGE/bin/
> >>> lx24-amd64/qrsh for launching
> >>> [hpcp7781:08527] mca:base:select:( plm) Query of component [rsh]
> >>> set priority to 10
> >>> [hpcp7781:08527] mca:base:select:( plm) Querying component
> [slurm]
> >>> [hpcp7781:08527] mca:base:select:( plm) Skipping component
> >>> [slurm]. Query failed to return a module
> >>> [hpcp7781:08527] mca:base:select:( plm) Selected component [rsh]
> >>> [hpcp7781:08527] mca: base: close: component slurm closed
> >>> [hpcp7781:08527] mca: base: close: unloading component slurm
> >>> Starting server daemon at host "hpcp7782"
> >>> error: executing task of job 1702026 failed:
> >>>
> >>
> ---------------------------------------------------------------------
> >> -
> >>> ----
> >>> A daemon (pid 8528) died unexpectedly with status 1 while
> attempting
> >>> to launch so we are aborting.
> >>>
> >>> There may be more information reported by the environment (see
> >> above).
> >>>
> >>> This may be because the daemon was unable to find all the needed
> >>> shared
> >>> libraries on the remote node. You may set your LD_LIBRARY_PATH to
> >>> have the
> >>> location of the shared libraries on the remote nodes and this will
> >>> automatically be forwarded to the remote nodes.
> >>>
> >>
> ---------------------------------------------------------------------
> >> -
> >>> ----
> >>>
> >>
> ---------------------------------------------------------------------
> >> -
> >>> ----
> >>> mpirun noticed that the job aborted, but has no info as to the
> >> process
> >>> that caused that situation.
> >>>
> >>
> ---------------------------------------------------------------------
> >> -
> >>> ----
> >>> mpirun: clean termination accomplished
> >>>
> >>> [hpcp7781:08527] mca: base: close: component rsh closed
> >>> [hpcp7781:08527] mca: base: close: unloading component rsh
> >>>
> >>>
> >>>
> >>>
> >>> Seems to me orted is not starting on the remote node. I have
> >>> LD_LIBRARY_PATH set on my shell startup files. If I do an ldd on
> >>> orted i see this:
> >>>
> >>> hpcp7781(salmr0)135:ldd /bphpc7/vol0/salmr0/ompi/bin/orted
> >>> libopen-rte.so.0 => /bphpc7/vol0/salmr0/ompi/lib/libopen-
> >>> rte.so.0 (0x00002ac5b14e2000)
> >>> libopen-pal.so.0 => /bphpc7/vol0/salmr0/ompi/lib/libopen-
> >>> pal.so.0 (0x00002ac5b1628000)
> >>> libdl.so.2 => /lib64/libdl.so.2 (0x00002ac5b17a9000)
> >>> libnsl.so.1 => /lib64/libnsl.so.1 (0x00002ac5b18ad000)
> >>> libutil.so.1 => /lib64/libutil.so.1 (0x00002ac5b19c4000)
> >>> libm.so.6 => /lib64/libm.so.6 (0x00002ac5b1ac7000)
> >>> libpthread.so.0 => /lib64/libpthread.so.0
> >> (0x00002ac5b1c1c000)
> >>> libc.so.6 => /lib64/libc.so.6 (0x00002ac5b1d34000)
> >>> /lib64/ld-linux-x86-64.so.2 (0x00002ac5b13c6000)
> >>>
> >>>
> >>> Looks like gridengine is using qrsh to start orted on the remote
> >>> nodes. qrsh might not be reading my shell startup file and setting
> >>> LD_LIBRARY_PATH.
> >>>
> >>> Thanks for any help with this.
> >>>
> >>> Rene
> >>>
> >>>
> >>> _______________________________________________
> >>> users mailing list
> >>> users_at_[hidden]
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >> _______________________________________________
> >> users mailing list
> >> users_at_[hidden]
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >>
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>