Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] openmpi 1.3 and gridengine tight integrationproblem
From: Reuti (reuti_at_[hidden])
Date: 2009-03-18 09:52:00


Hi,

Am 18.03.2009 um 14:25 schrieb Rene Salmon:

> Thanks for the help. I only use the machine file to run outside of
> SGE
> just to test/prove that things work outside of SGE.

aha. Did you compile Open MPI 1.3 with the SGE option?

> When I run with in SGE here is what the job script looks like:
>
> hpcp7781(salmr0)128:cat simple-job.sh
> #!/bin/csh
> #
> #$ -S /bin/csh

-S will only work if the queue configuration is set to
posix_compliant. If it's set to unix_behavior, the first line of the
script is already sufficient.

> setenv LD_LIBRARY_PATH /bphpc7/vol0/salmr0/ompi/lib

Maybe you have to set this LD_LIBRARY_PATH in your .cshrc, so it's
known automatically on the nodes.

> mpirun --mca plm_base_verbose 20 --prefix /bphpc7/vol0/salmr0/ompi -np
> 16 /bphpc7/vol0/salmr0/SGE/a.out

Do you use --mca... only for debugging or why is it added here?

-- Reuti

>
> We are using PEs. Here is what the PE looks like:
>
> hpcp7781(salmr0)129:qconf -sp pavtest
> pe_name pavtest
> slots 16
> user_lists NONE
> xuser_lists NONE
> start_proc_args /bin/true
> stop_proc_args /bin/true
> allocation_rule 8
> control_slaves FALSE
> job_is_first_task FALSE
> urgency_slots min
>
>
> here is he qsub line to submit the job:
>
>>> qsub -pe pavtest 16 simple-job.sh
>
>
> The job seems to run fine with no problems with in SGE if I contain
> the
> job with in one node. As soon as the job has to use more than one one
> things stop working with the message I posted about LD_LIBRARY_PATH
> and
> orted seems not to start on the remote nodes.
>
> Thanks
> Rene
>
>
>
>
> On Wed, 2009-03-18 at 07:45 +0000, Reuti wrote:
>> Hi,
>>
>> it shouldn't be necessary to supply a machinefile, as the one
>> generated by SGE is taken automatically (i.e. the granted nodes are
>> honored). You submitted the job requesting a PE?
>>
>> -- Reuti
>>
>>
>> Am 18.03.2009 um 04:51 schrieb Salmon, Rene:
>>
>>>
>>> Hi,
>>>
>>> I have looked through the list archives and google but could not
>>> find anything related to what I am seeing. I am simply trying to
>>> run the basic cpi.c code using SGE and tight integration.
>>>
>>> If run outside SGE i can run my jobs just fine:
>>> hpcp7781(salmr0)132:mpiexec -np 2 --machinefile x a.out
>>> Process 0 on hpcp7781
>>> Process 1 on hpcp7782
>>> pi is approximately 3.1416009869231241, Error is 0.0000083333333309
>>> wall clock time = 0.032325
>>>
>>>
>>> If I submit to SGE I get this:
>>>
>>> [hpcp7781:08527] mca: base: components_open: Looking for plm
>>> components
>>> [hpcp7781:08527] mca: base: components_open: opening plm components
>>> [hpcp7781:08527] mca: base: components_open: found loaded component
>>> rsh
>>> [hpcp7781:08527] mca: base: components_open: component rsh has no
>>> register function
>>> [hpcp7781:08527] mca: base: components_open: component rsh open
>>> function successful
>>> [hpcp7781:08527] mca: base: components_open: found loaded component
>>> slurm
>>> [hpcp7781:08527] mca: base: components_open: component slurm has no
>>> register function
>>> [hpcp7781:08527] mca: base: components_open: component slurm open
>>> function successful
>>> [hpcp7781:08527] mca:base:select: Auto-selecting plm components
>>> [hpcp7781:08527] mca:base:select:( plm) Querying component [rsh]
>>> [hpcp7781:08527] [[INVALID],INVALID] plm:rsh: using /hpc/SGE/bin/
>>> lx24-amd64/qrsh for launching
>>> [hpcp7781:08527] mca:base:select:( plm) Query of component [rsh]
>>> set priority to 10
>>> [hpcp7781:08527] mca:base:select:( plm) Querying component [slurm]
>>> [hpcp7781:08527] mca:base:select:( plm) Skipping component
>>> [slurm]. Query failed to return a module
>>> [hpcp7781:08527] mca:base:select:( plm) Selected component [rsh]
>>> [hpcp7781:08527] mca: base: close: component slurm closed
>>> [hpcp7781:08527] mca: base: close: unloading component slurm
>>> Starting server daemon at host "hpcp7782"
>>> error: executing task of job 1702026 failed:
>>>
>> ---------------------------------------------------------------------
>> -
>>> ----
>>> A daemon (pid 8528) died unexpectedly with status 1 while attempting
>>> to launch so we are aborting.
>>>
>>> There may be more information reported by the environment (see
>> above).
>>>
>>> This may be because the daemon was unable to find all the needed
>>> shared
>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to
>>> have the
>>> location of the shared libraries on the remote nodes and this will
>>> automatically be forwarded to the remote nodes.
>>>
>> ---------------------------------------------------------------------
>> -
>>> ----
>>>
>> ---------------------------------------------------------------------
>> -
>>> ----
>>> mpirun noticed that the job aborted, but has no info as to the
>> process
>>> that caused that situation.
>>>
>> ---------------------------------------------------------------------
>> -
>>> ----
>>> mpirun: clean termination accomplished
>>>
>>> [hpcp7781:08527] mca: base: close: component rsh closed
>>> [hpcp7781:08527] mca: base: close: unloading component rsh
>>>
>>>
>>>
>>>
>>> Seems to me orted is not starting on the remote node. I have
>>> LD_LIBRARY_PATH set on my shell startup files. If I do an ldd on
>>> orted i see this:
>>>
>>> hpcp7781(salmr0)135:ldd /bphpc7/vol0/salmr0/ompi/bin/orted
>>> libopen-rte.so.0 => /bphpc7/vol0/salmr0/ompi/lib/libopen-
>>> rte.so.0 (0x00002ac5b14e2000)
>>> libopen-pal.so.0 => /bphpc7/vol0/salmr0/ompi/lib/libopen-
>>> pal.so.0 (0x00002ac5b1628000)
>>> libdl.so.2 => /lib64/libdl.so.2 (0x00002ac5b17a9000)
>>> libnsl.so.1 => /lib64/libnsl.so.1 (0x00002ac5b18ad000)
>>> libutil.so.1 => /lib64/libutil.so.1 (0x00002ac5b19c4000)
>>> libm.so.6 => /lib64/libm.so.6 (0x00002ac5b1ac7000)
>>> libpthread.so.0 => /lib64/libpthread.so.0
>> (0x00002ac5b1c1c000)
>>> libc.so.6 => /lib64/libc.so.6 (0x00002ac5b1d34000)
>>> /lib64/ld-linux-x86-64.so.2 (0x00002ac5b13c6000)
>>>
>>>
>>> Looks like gridengine is using qrsh to start orted on the remote
>>> nodes. qrsh might not be reading my shell startup file and setting
>>> LD_LIBRARY_PATH.
>>>
>>> Thanks for any help with this.
>>>
>>> Rene
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users