Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] Fwd: [GE users] Open MPI job fails when run thru SGE
From: Sangamesh B (forum.san_at_[hidden])
Date: 2009-01-30 09:02:39


Dear Open MPI,

Do you have a solution for the following problem of Open MPI (1.3)
when run through Grid Engine.

I changed global execd params with H_MEMORYLOCKED=infinity and
restarted the sgeexecd in all nodes.

But still the problem persists:

 $cat err.77.CPMD-OMPI
ssh_exchange_identification: Connection closed by remote host
--------------------------------------------------------------------------
A daemon (pid 31947) died unexpectedly with status 129 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
ssh_exchange_identification: Connection closed by remote host
--------------------------------------------------------------------------
mpirun was unable to cleanly terminate the daemons on the nodes shown
below. Additional manual cleanup may be required - please refer to
the "orte-clean" tool for assistance.
--------------------------------------------------------------------------
       node-0-19.local - daemon did not report back when launched
       node-0-20.local - daemon did not report back when launched
       node-0-21.local - daemon did not report back when launched
       node-0-22.local - daemon did not report back when launched

The hostnames for infiniband interfaces are ibc0, ibc1, ibc2 .. ibc23.
May be Open MPI is not able to identify hosts as it is using node-0-..
. Is this causing open mpi to fail?

Thanks,
Sangamesh

On Mon, Jan 26, 2009 at 5:09 PM, mihlon <vaclam1_at_[hidden]> wrote:
> Hi,
>
>> Hello SGE users,
>>
>> The cluster is installed with Rocks-4.3, SGE 6.0 & Open MPI 1.3.
>> Open MPI is configured with "--with-sge".
>> ompi_info shows only one component:
>> # /opt/mpi/openmpi/1.3/intel/bin/ompi_info | grep gridengine
>> MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.3)
>>
>> Is this acceptable?
> maybe yes
>
> see: http://www.open-mpi.org/faq/?category=building#build-rte-sge
>
> shell$ ompi_info | grep gridengine
> MCA ras: gridengine (MCA v1.0, API v1.3, Component v1.3)
> MCA pls: gridengine (MCA v1.0, API v1.3, Component v1.3)
>
> (Specific frameworks and version numbers may vary, depending on your
> version of Open MPI.)
>
>> The Open MPI parallel jobs run successfully through command line, but
>> fail when run thru SGE(with -pe orte <slots>).
>>
>> The error is:
>>
>> $ cat err.26.Helloworld-PRL
>> ssh_exchange_identification: Connection closed by remote host
>> --------------------------------------------------------------------------
>> A daemon (pid 8462) died unexpectedly with status 129 while attempting
>> to launch so we are aborting.
>>
>> There may be more information reported by the environment (see above).
>>
>> This may be because the daemon was unable to find all the needed shared
>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
>> location of the shared libraries on the remote nodes and this will
>> automatically be forwarded to the remote nodes.
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> mpirun noticed that the job aborted, but has no info as to the process
>> that caused that situation.
>> --------------------------------------------------------------------------
>> mpirun: clean termination accomplished
>>
>> But the same job runs well, if it runs on a single node but with an error:
>>
>> $ cat err.23.Helloworld-PRL
>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>> This will severely limit memory registrations.
>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>> This will severely limit memory registrations.
>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>> This will severely limit memory registrations.
>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>> This will severely limit memory registrations.
>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>> This will severely limit memory registrations.
>> --------------------------------------------------------------------------
>> WARNING: There was an error initializing an OpenFabrics device.
>>
>> Local host: node-0-4.local
>> Local device: mthca0
>> --------------------------------------------------------------------------
>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>> This will severely limit memory registrations.
>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>> This will severely limit memory registrations.
>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>> This will severely limit memory registrations.
>> [node-0-4.local:07869] 7 more processes have sent help message
>> help-mpi-btl-openib.txt / error in device init
>> [node-0-4.local:07869] Set MCA parameter "orte_base_help_aggregate" to
>> 0 to see all help / error messages
>>
>> The following link explains the same problem:
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=72398
>>
>> With this reference, I put 'ulimit -l unlimited' into
>> /etc/init.d/sgeexecd in all nodes. Restarted the services.
>
> Do not set 'ulimit -l unlimited' in /etc/init.d/sgeexecd
> but set it in the SGE:
>
> Run qconf -mconf and set execd_params
>
>
> frontend$> qconf -sconf
> ...
> execd_params H_MEMORYLOCKED=infinity
> ...
>
>
> Then restart all your sgeexecd hosts.
>
>
> Milan
>
>> But still the problem persists.
>>
>> What could be the way out for this?
>>
>> Thanks,
>> Sangamesh
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=99133
>>
>> To unsubscribe from this discussion, e-mail:
>> [users-unsubscribe_at_[hidden]].
>>
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=99461
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe_at_[hidden]].
>