Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Fwd: [GE users] Open MPI job fails when run thru SGE
From: Reuti (reuti_at_[hidden])
Date: 2009-01-30 11:50:22


Am 30.01.2009 um 15:02 schrieb Sangamesh B:

> Dear Open MPI,
>
> Do you have a solution for the following problem of Open MPI (1.3)
> when run through Grid Engine.
>
> I changed global execd params with H_MEMORYLOCKED=infinity and
> restarted the sgeexecd in all nodes.
>
> But still the problem persists:
>
> $cat err.77.CPMD-OMPI
> ssh_exchange_identification: Connection closed by remote host

I think this might already be the reason why it's not working. A
mpihello program is running fine through SGE?

-- Reuti

> ----------------------------------------------------------------------
> ----
> A daemon (pid 31947) died unexpectedly with status 129 while
> attempting
> to launch so we are aborting.
>
> There may be more information reported by the environment (see above).
>
> This may be because the daemon was unable to find all the needed
> shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to
> have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> ----------------------------------------------------------------------
> ----
> ----------------------------------------------------------------------
> ----
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> ----------------------------------------------------------------------
> ----
> ssh_exchange_identification: Connection closed by remote host
> ----------------------------------------------------------------------
> ----
> mpirun was unable to cleanly terminate the daemons on the nodes shown
> below. Additional manual cleanup may be required - please refer to
> the "orte-clean" tool for assistance.
> ----------------------------------------------------------------------
> ----
> node-0-19.local - daemon did not report back when launched
> node-0-20.local - daemon did not report back when launched
> node-0-21.local - daemon did not report back when launched
> node-0-22.local - daemon did not report back when launched
>
> The hostnames for infiniband interfaces are ibc0, ibc1, ibc2 .. ibc23.
> May be Open MPI is not able to identify hosts as it is using node-0-..
> . Is this causing open mpi to fail?
>
> Thanks,
> Sangamesh
>
>
> On Mon, Jan 26, 2009 at 5:09 PM, mihlon <vaclam1_at_[hidden]> wrote:
>> Hi,
>>
>>> Hello SGE users,
>>>
>>> The cluster is installed with Rocks-4.3, SGE 6.0 & Open MPI 1.3.
>>> Open MPI is configured with "--with-sge".
>>> ompi_info shows only one component:
>>> # /opt/mpi/openmpi/1.3/intel/bin/ompi_info | grep gridengine
>>> MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.3)
>>>
>>> Is this acceptable?
>> maybe yes
>>
>> see: http://www.open-mpi.org/faq/?category=building#build-rte-sge
>>
>> shell$ ompi_info | grep gridengine
>> MCA ras: gridengine (MCA v1.0, API v1.3, Component v1.3)
>> MCA pls: gridengine (MCA v1.0, API v1.3, Component v1.3)
>>
>> (Specific frameworks and version numbers may vary, depending on your
>> version of Open MPI.)
>>
>>> The Open MPI parallel jobs run successfully through command line,
>>> but
>>> fail when run thru SGE(with -pe orte <slots>).
>>>
>>> The error is:
>>>
>>> $ cat err.26.Helloworld-PRL
>>> ssh_exchange_identification: Connection closed by remote host
>>> --------------------------------------------------------------------
>>> ------
>>> A daemon (pid 8462) died unexpectedly with status 129 while
>>> attempting
>>> to launch so we are aborting.
>>>
>>> There may be more information reported by the environment (see
>>> above).
>>>
>>> This may be because the daemon was unable to find all the needed
>>> shared
>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to
>>> have the
>>> location of the shared libraries on the remote nodes and this will
>>> automatically be forwarded to the remote nodes.
>>> --------------------------------------------------------------------
>>> ------
>>> --------------------------------------------------------------------
>>> ------
>>> mpirun noticed that the job aborted, but has no info as to the
>>> process
>>> that caused that situation.
>>> --------------------------------------------------------------------
>>> ------
>>> mpirun: clean termination accomplished
>>>
>>> But the same job runs well, if it runs on a single node but with
>>> an error:
>>>
>>> $ cat err.23.Helloworld-PRL
>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>> This will severely limit memory registrations.
>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>> This will severely limit memory registrations.
>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>> This will severely limit memory registrations.
>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>> This will severely limit memory registrations.
>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>> This will severely limit memory registrations.
>>> --------------------------------------------------------------------
>>> ------
>>> WARNING: There was an error initializing an OpenFabrics device.
>>>
>>> Local host: node-0-4.local
>>> Local device: mthca0
>>> --------------------------------------------------------------------
>>> ------
>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>> This will severely limit memory registrations.
>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>> This will severely limit memory registrations.
>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>> This will severely limit memory registrations.
>>> [node-0-4.local:07869] 7 more processes have sent help message
>>> help-mpi-btl-openib.txt / error in device init
>>> [node-0-4.local:07869] Set MCA parameter
>>> "orte_base_help_aggregate" to
>>> 0 to see all help / error messages
>>>
>>> The following link explains the same problem:
>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>> dsForumId=38&dsMessageId=72398
>>>
>>> With this reference, I put 'ulimit -l unlimited' into
>>> /etc/init.d/sgeexecd in all nodes. Restarted the services.
>>
>> Do not set 'ulimit -l unlimited' in /etc/init.d/sgeexecd
>> but set it in the SGE:
>>
>> Run qconf -mconf and set execd_params
>>
>>
>> frontend$> qconf -sconf
>> ...
>> execd_params H_MEMORYLOCKED=infinity
>> ...
>>
>>
>> Then restart all your sgeexecd hosts.
>>
>>
>> Milan
>>
>>> But still the problem persists.
>>>
>>> What could be the way out for this?
>>>
>>> Thanks,
>>> Sangamesh
>>>
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>> dsForumId=38&dsMessageId=99133
>>>
>>> To unsubscribe from this discussion, e-mail:
>>> [users-unsubscribe_at_[hidden]].
>>>
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?
>> dsForumId=38&dsMessageId=99461
>>
>> To unsubscribe from this discussion, e-mail: [users-
>> unsubscribe_at_[hidden]].
>>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users