Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Ompi runs thru cmd line but fails when run thru SGE
From: Reuti (reuti_at_[hidden])
Date: 2009-01-24 15:52:01


Am 24.01.2009 um 17:12 schrieb Jeremy Stout:

> The RLIMIT error is very common when using OpenMPI + OFED + Sun Grid
> Engine. You can find more information and several remedies here:
> http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
>
> I usually resolve this problem by adding "ulimit -l unlimited" near
> the top of the SGE startup script on the computation nodes and
> restarting SGE on every node.

Did you request/set any limits with SGE's h_vmem/h_stack resource
request?

-- Reuti

> Jeremy Stout
>
> On Sat, Jan 24, 2009 at 6:06 AM, Sangamesh B <forum.san_at_[hidden]>
> wrote:
>> Hello all,
>>
>> Open MPI 1.3 is installed on Rocks 4.3 Linux cluster with support of
>> SGE i.e using --with-sge.
>> But the ompi_info shows only one component:
>> # /opt/mpi/openmpi/1.3/intel/bin/ompi_info | grep gridengine
>> MCA ras: gridengine (MCA v2.0, API v2.0, Component
>> v1.3)
>>
>> Is this right? Because during ompi installation SGE qmaster daemon
>> was
>> not working.
>>
>> Now the problem is, the open mpi parallel jobs submitted thru
>> gridengine are failing (when run on multiple nodes) with the error:
>>
>> $ cat err.26.Helloworld-PRL
>> ssh_exchange_identification: Connection closed by remote host
>> ---------------------------------------------------------------------
>> -----
>> A daemon (pid 8462) died unexpectedly with status 129 while
>> attempting
>> to launch so we are aborting.
>>
>> There may be more information reported by the environment (see
>> above).
>>
>> This may be because the daemon was unable to find all the needed
>> shared
>> libraries on the remote node. You may set your LD_LIBRARY_PATH to
>> have the
>> location of the shared libraries on the remote nodes and this will
>> automatically be forwarded to the remote nodes.
>> ---------------------------------------------------------------------
>> -----
>> ---------------------------------------------------------------------
>> -----
>> mpirun noticed that the job aborted, but has no info as to the
>> process
>> that caused that situation.
>> ---------------------------------------------------------------------
>> -----
>> mpirun: clean termination accomplished
>>
>> When the job runs on single node, it runs well with producing the
>> output but with an error:
>> $ cat err.23.Helloworld-PRL
>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>> This will severely limit memory registrations.
>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>> This will severely limit memory registrations.
>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>> This will severely limit memory registrations.
>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>> This will severely limit memory registrations.
>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>> This will severely limit memory registrations.
>> ---------------------------------------------------------------------
>> -----
>> WARNING: There was an error initializing an OpenFabrics device.
>>
>> Local host: node-0-4.local
>> Local device: mthca0
>> ---------------------------------------------------------------------
>> -----
>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>> This will severely limit memory registrations.
>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>> This will severely limit memory registrations.
>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>> This will severely limit memory registrations.
>> [node-0-4.local:07869] 7 more processes have sent help message
>> help-mpi-btl-openib.txt / error in device init
>> [node-0-4.local:07869] Set MCA parameter
>> "orte_base_help_aggregate" to
>> 0 to see all help / error messages
>>
>> What may be the problem for this behavior?
>>
>> Thanks,
>> Sangamesh
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users