Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] openmpi 1.3 and gridengine tight integrationproblem
From: Rolf Vandevaart (Rolf.Vandevaart_at_[hidden])
Date: 2009-03-18 10:05:43


On 03/18/09 09:52, Reuti wrote:
> Hi,
>
> Am 18.03.2009 um 14:25 schrieb Rene Salmon:
>
>> Thanks for the help. I only use the machine file to run outside of SGE
>> just to test/prove that things work outside of SGE.
>
> aha. Did you compile Open MPI 1.3 with the SGE option?
>
>
>> When I run with in SGE here is what the job script looks like:
>>
>> hpcp7781(salmr0)128:cat simple-job.sh
>> #!/bin/csh
>> #
>> #$ -S /bin/csh
>
> -S will only work if the queue configuration is set to posix_compliant.
> If it's set to unix_behavior, the first line of the script is already
> sufficient.
>
>
>> setenv LD_LIBRARY_PATH /bphpc7/vol0/salmr0/ompi/lib
>
> Maybe you have to set this LD_LIBRARY_PATH in your .cshrc, so it's known
> automatically on the nodes.
>
>> mpirun --mca plm_base_verbose 20 --prefix /bphpc7/vol0/salmr0/ompi -np
>> 16 /bphpc7/vol0/salmr0/SGE/a.out
>
> Do you use --mca... only for debugging or why is it added here?
>
> -- Reuti
>
>
>>
>> We are using PEs. Here is what the PE looks like:
>>
>> hpcp7781(salmr0)129:qconf -sp pavtest
>> pe_name pavtest
>> slots 16
>> user_lists NONE
>> xuser_lists NONE
>> start_proc_args /bin/true
>> stop_proc_args /bin/true
>> allocation_rule 8
>> control_slaves FALSE
>> job_is_first_task FALSE
>> urgency_slots min

At this FAQ, we show an example of a parallel environment setup.
http://www.open-mpi.org/faq/?category=running#run-n1ge-or-sge

I am wondering if the control_slaves needs to be TRUE.
And double check the that the PE (pavtest) is on the list for the queue
(also mentioned at the FAQ). And perhaps start trying to run hostname
first.

Rolf

>>
>>
>> here is he qsub line to submit the job:
>>
>>>> qsub -pe pavtest 16 simple-job.sh
>>
>>
>> The job seems to run fine with no problems with in SGE if I contain the
>> job with in one node. As soon as the job has to use more than one one
>> things stop working with the message I posted about LD_LIBRARY_PATH and
>> orted seems not to start on the remote nodes.
>>
>> Thanks
>> Rene
>>
>>
>>
>>
>> On Wed, 2009-03-18 at 07:45 +0000, Reuti wrote:
>>> Hi,
>>>
>>> it shouldn't be necessary to supply a machinefile, as the one
>>> generated by SGE is taken automatically (i.e. the granted nodes are
>>> honored). You submitted the job requesting a PE?
>>>
>>> -- Reuti
>>>
>>>
>>> Am 18.03.2009 um 04:51 schrieb Salmon, Rene:
>>>
>>>>
>>>> Hi,
>>>>
>>>> I have looked through the list archives and google but could not
>>>> find anything related to what I am seeing. I am simply trying to
>>>> run the basic cpi.c code using SGE and tight integration.
>>>>
>>>> If run outside SGE i can run my jobs just fine:
>>>> hpcp7781(salmr0)132:mpiexec -np 2 --machinefile x a.out
>>>> Process 0 on hpcp7781
>>>> Process 1 on hpcp7782
>>>> pi is approximately 3.1416009869231241, Error is 0.0000083333333309
>>>> wall clock time = 0.032325
>>>>
>>>>
>>>> If I submit to SGE I get this:
>>>>
>>>> [hpcp7781:08527] mca: base: components_open: Looking for plm
>>>> components
>>>> [hpcp7781:08527] mca: base: components_open: opening plm components
>>>> [hpcp7781:08527] mca: base: components_open: found loaded component
>>>> rsh
>>>> [hpcp7781:08527] mca: base: components_open: component rsh has no
>>>> register function
>>>> [hpcp7781:08527] mca: base: components_open: component rsh open
>>>> function successful
>>>> [hpcp7781:08527] mca: base: components_open: found loaded component
>>>> slurm
>>>> [hpcp7781:08527] mca: base: components_open: component slurm has no
>>>> register function
>>>> [hpcp7781:08527] mca: base: components_open: component slurm open
>>>> function successful
>>>> [hpcp7781:08527] mca:base:select: Auto-selecting plm components
>>>> [hpcp7781:08527] mca:base:select:( plm) Querying component [rsh]
>>>> [hpcp7781:08527] [[INVALID],INVALID] plm:rsh: using /hpc/SGE/bin/
>>>> lx24-amd64/qrsh for launching
>>>> [hpcp7781:08527] mca:base:select:( plm) Query of component [rsh]
>>>> set priority to 10
>>>> [hpcp7781:08527] mca:base:select:( plm) Querying component [slurm]
>>>> [hpcp7781:08527] mca:base:select:( plm) Skipping component
>>>> [slurm]. Query failed to return a module
>>>> [hpcp7781:08527] mca:base:select:( plm) Selected component [rsh]
>>>> [hpcp7781:08527] mca: base: close: component slurm closed
>>>> [hpcp7781:08527] mca: base: close: unloading component slurm
>>>> Starting server daemon at host "hpcp7782"
>>>> error: executing task of job 1702026 failed:
>>>>
>>> ----------------------------------------------------------------------
>>>> ----
>>>> A daemon (pid 8528) died unexpectedly with status 1 while attempting
>>>> to launch so we are aborting.
>>>>
>>>> There may be more information reported by the environment (see
>>> above).
>>>>
>>>> This may be because the daemon was unable to find all the needed
>>>> shared
>>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to
>>>> have the
>>>> location of the shared libraries on the remote nodes and this will
>>>> automatically be forwarded to the remote nodes.
>>>>
>>> ----------------------------------------------------------------------
>>>> ----
>>>>
>>> ----------------------------------------------------------------------
>>>> ----
>>>> mpirun noticed that the job aborted, but has no info as to the
>>> process
>>>> that caused that situation.
>>>>
>>> ----------------------------------------------------------------------
>>>> ----
>>>> mpirun: clean termination accomplished
>>>>
>>>> [hpcp7781:08527] mca: base: close: component rsh closed
>>>> [hpcp7781:08527] mca: base: close: unloading component rsh
>>>>
>>>>
>>>>
>>>>
>>>> Seems to me orted is not starting on the remote node. I have
>>>> LD_LIBRARY_PATH set on my shell startup files. If I do an ldd on
>>>> orted i see this:
>>>>
>>>> hpcp7781(salmr0)135:ldd /bphpc7/vol0/salmr0/ompi/bin/orted
>>>> libopen-rte.so.0 => /bphpc7/vol0/salmr0/ompi/lib/libopen-
>>>> rte.so.0 (0x00002ac5b14e2000)
>>>> libopen-pal.so.0 => /bphpc7/vol0/salmr0/ompi/lib/libopen-
>>>> pal.so.0 (0x00002ac5b1628000)
>>>> libdl.so.2 => /lib64/libdl.so.2 (0x00002ac5b17a9000)
>>>> libnsl.so.1 => /lib64/libnsl.so.1 (0x00002ac5b18ad000)
>>>> libutil.so.1 => /lib64/libutil.so.1 (0x00002ac5b19c4000)
>>>> libm.so.6 => /lib64/libm.so.6 (0x00002ac5b1ac7000)
>>>> libpthread.so.0 => /lib64/libpthread.so.0
>>> (0x00002ac5b1c1c000)
>>>> libc.so.6 => /lib64/libc.so.6 (0x00002ac5b1d34000)
>>>> /lib64/ld-linux-x86-64.so.2 (0x00002ac5b13c6000)
>>>>
>>>>
>>>> Looks like gridengine is using qrsh to start orted on the remote
>>>> nodes. qrsh might not be reading my shell startup file and setting
>>>> LD_LIBRARY_PATH.
>>>>
>>>> Thanks for any help with this.
>>>>
>>>> Rene
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
=========================
rolf.vandevaart_at_[hidden]
781-442-3043
=========================