Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Ompi runs thru cmd line but fails when run thru SGE
From: Reuti (reuti_at_[hidden])
Date: 2009-01-24 15:51:13


Am 24.01.2009 um 19:23 schrieb Sangamesh B:

> Thanks for your suggestion.
>
> In my case that is also not working.
>
> It works when the job runs on a single node, without any errors:
>
> But when it runs on multiple nodes, it gives:
>
> ssh_exchange_identification: Connection closed by remote host

Can you also ssh between the nodes? Seems you are using ssh for Open
MPI and adjusted SGE to do the same for its qrsh wrapper. Difference
between running mpirun standalone and in an SGE job is, that a simple
mpirun on the headnode will make the connections directly to the
nodes, but inside SGE the mpirun will occur on one of the nodes and
makes the connections to other nodes. Is this also possible? Do you
need ssh for sure?

-- Reuti

> ----------------------------------------------------------------------
> ----
> A daemon (pid 8462) died unexpectedly with status 129 while attempting
> to launch so we are aborting.
>
> There may be more information reported by the environment (see above).
>
> This may be because the daemon was unable to find all the needed
> shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to
> have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
>
> ----------------------------------------------------------------------
> ----
>
> ----------------------------------------------------------------------
> ----
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
>
> ----------------------------------------------------------------------
> ----
> mpirun: clean termination accomplished
>
> What may be the way out for this?
>
> Thanks,
> Sangamesh
>
> On Sat, Jan 24, 2009 at 9:42 PM, Jeremy Stout
> <stout.jeremy_at_[hidden]> wrote:
>> The RLIMIT error is very common when using OpenMPI + OFED + Sun Grid
>> Engine. You can find more information and several remedies here:
>> http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
>>
>> I usually resolve this problem by adding "ulimit -l unlimited" near
>> the top of the SGE startup script on the computation nodes and
>> restarting SGE on every node.
>>
>> Jeremy Stout
>>
>> On Sat, Jan 24, 2009 at 6:06 AM, Sangamesh B <forum.san_at_[hidden]>
>> wrote:
>>> Hello all,
>>>
>>> Open MPI 1.3 is installed on Rocks 4.3 Linux cluster with
>>> support of
>>> SGE i.e using --with-sge.
>>> But the ompi_info shows only one component:
>>> # /opt/mpi/openmpi/1.3/intel/bin/ompi_info | grep gridengine
>>> MCA ras: gridengine (MCA v2.0, API v2.0,
>>> Component v1.3)
>>>
>>> Is this right? Because during ompi installation SGE qmaster
>>> daemon was
>>> not working.
>>>
>>> Now the problem is, the open mpi parallel jobs submitted thru
>>> gridengine are failing (when run on multiple nodes) with the error:
>>>
>>> $ cat err.26.Helloworld-PRL
>>> ssh_exchange_identification: Connection closed by remote host
>>> --------------------------------------------------------------------
>>> ------
>>> A daemon (pid 8462) died unexpectedly with status 129 while
>>> attempting
>>> to launch so we are aborting.
>>>
>>> There may be more information reported by the environment (see
>>> above).
>>>
>>> This may be because the daemon was unable to find all the needed
>>> shared
>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to
>>> have the
>>> location of the shared libraries on the remote nodes and this will
>>> automatically be forwarded to the remote nodes.
>>> --------------------------------------------------------------------
>>> ------
>>> --------------------------------------------------------------------
>>> ------
>>> mpirun noticed that the job aborted, but has no info as to the
>>> process
>>> that caused that situation.
>>> --------------------------------------------------------------------
>>> ------
>>> mpirun: clean termination accomplished
>>>
>>> When the job runs on single node, it runs well with producing the
>>> output but with an error:
>>> $ cat err.23.Helloworld-PRL
>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>> This will severely limit memory registrations.
>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>> This will severely limit memory registrations.
>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>> This will severely limit memory registrations.
>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>> This will severely limit memory registrations.
>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>> This will severely limit memory registrations.
>>> --------------------------------------------------------------------
>>> ------
>>> WARNING: There was an error initializing an OpenFabrics device.
>>>
>>> Local host: node-0-4.local
>>> Local device: mthca0
>>> --------------------------------------------------------------------
>>> ------
>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>> This will severely limit memory registrations.
>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>> This will severely limit memory registrations.
>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>> This will severely limit memory registrations.
>>> [node-0-4.local:07869] 7 more processes have sent help message
>>> help-mpi-btl-openib.txt / error in device init
>>> [node-0-4.local:07869] Set MCA parameter
>>> "orte_base_help_aggregate" to
>>> 0 to see all help / error messages
>>>
>>> What may be the problem for this behavior?
>>>
>>> Thanks,
>>> Sangamesh
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users