Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Ompi runs thru cmd line but fails when run thru SGE
From: Jeremy Stout (stout.jeremy_at_[hidden])
Date: 2009-01-24 11:12:30


The RLIMIT error is very common when using OpenMPI + OFED + Sun Grid
Engine. You can find more information and several remedies here:
http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages

I usually resolve this problem by adding "ulimit -l unlimited" near
the top of the SGE startup script on the computation nodes and
restarting SGE on every node.

Jeremy Stout

On Sat, Jan 24, 2009 at 6:06 AM, Sangamesh B <forum.san_at_[hidden]> wrote:
> Hello all,
>
> Open MPI 1.3 is installed on Rocks 4.3 Linux cluster with support of
> SGE i.e using --with-sge.
> But the ompi_info shows only one component:
> # /opt/mpi/openmpi/1.3/intel/bin/ompi_info | grep gridengine
> MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.3)
>
> Is this right? Because during ompi installation SGE qmaster daemon was
> not working.
>
> Now the problem is, the open mpi parallel jobs submitted thru
> gridengine are failing (when run on multiple nodes) with the error:
>
> $ cat err.26.Helloworld-PRL
> ssh_exchange_identification: Connection closed by remote host
> --------------------------------------------------------------------------
> A daemon (pid 8462) died unexpectedly with status 129 while attempting
> to launch so we are aborting.
>
> There may be more information reported by the environment (see above).
>
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --------------------------------------------------------------------------
> mpirun: clean termination accomplished
>
> When the job runs on single node, it runs well with producing the
> output but with an error:
> $ cat err.23.Helloworld-PRL
> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
> This will severely limit memory registrations.
> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
> This will severely limit memory registrations.
> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
> This will severely limit memory registrations.
> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
> This will severely limit memory registrations.
> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
> This will severely limit memory registrations.
> --------------------------------------------------------------------------
> WARNING: There was an error initializing an OpenFabrics device.
>
> Local host: node-0-4.local
> Local device: mthca0
> --------------------------------------------------------------------------
> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
> This will severely limit memory registrations.
> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
> This will severely limit memory registrations.
> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
> This will severely limit memory registrations.
> [node-0-4.local:07869] 7 more processes have sent help message
> help-mpi-btl-openib.txt / error in device init
> [node-0-4.local:07869] Set MCA parameter "orte_base_help_aggregate" to
> 0 to see all help / error messages
>
> What may be the problem for this behavior?
>
> Thanks,
> Sangamesh
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>