Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Cluster with IB hosts and Ethernet hosts
From: Sangamesh B (forum.san_at_[hidden])
Date: 2009-01-23 09:28:15


On Fri, Jan 23, 2009 at 5:41 PM, Jeff Squyres <jsquyres_at_[hidden]> wrote:
> On Jan 22, 2009, at 11:26 PM, Sangamesh B wrote:
>
>> We''ve a cluster with 23 nodes connected to IB switch and 8 nodes
>> have connected to ethernet switch. Master node is also connected to IB
>> switch. SGE(with tight integration, -pe orte) is used for
>> parallel/serial job submission.
>>
>> Open MPI-1.3 is installed on master node with IB support
>> (--with-openib=/usr). The same folder is copied to the remaining 23 IB
>> nodes.
>
> Sounds good.
>
>> Now what shall I do for remaining 8 ethernet nodes:
>> (1) Copy the same folder(IB) to these nodes
>> (2) Install Open MPI on one of the 8 eight ethernet nodes. Copy the
>> same to 7 nodes.
>> (3) Install an ethernet version of Open MPI on master node and copy to 8
>> nodes.
>
> Either 1 or 2 is your best bet.
>
> Do you have OFED installed on all nodes (either explicitly, or included in
> your Linux distro)?
No
>
> If so, I believe that at least some users with configurations like this
> install OMPI with OFED support (--with-openib=/usr, as you mentioned above)
> on all nodes. OMPI will notice that there is no OpenFabrics-capable
> hardware on the ethernet-only nodes and will simply not use the openib BTL
> plugin.
>
> Note that OMPI v1.3 got better about being silent about the lack of
> OpenFabrics devices when the openib BTL is present (OMPI v1.2 issued a
> warning about this).
>
> How you intend to use this setup is up to you; you may want to restrict jobs
> to 100% IB or 100% ethernet via SGE, or you may want to let them mix,
> realizing that the overall parallel job may be slowed down to the speed of
> the slowest network (e.g., ethernet).
>

Now I've two basic problems:

(1) Open MPI 1.3 is configurred as:
# ./configure --prefix=/opt/mpi/openmpi/1.3/intel --with-sge
--with-openib=/usr | tee config_out

But,

 /opt/mpi/openmpi/1.3/intel/bin/ompi_info | grep gridengine
                 MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.3)

shows only one component. Is this ok?

(2) Open MPI is itself not working
ssh: connect to host chemfist.iitk.ac.in port 22: Connection timed out
A daemon (pid 31343) died unexpectedly with status 1 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
mpirun: clean termination accomplished

On two nodes:

# /opt/mpi/openmpi/1.3/intel/bin/mpirun -np 2 -hostfile ih hostname
bash: /opt/mpi/openmpi/1.3/intel/bin/orted: No such file or directory
--------------------------------------------------------------------------
A daemon (pid 31184) died unexpectedly with status 127 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun was unable to cleanly terminate the daemons on the nodes shown
below. Additional manual cleanup may be required - please refer to
the "orte-clean" tool for assistance.
--------------------------------------------------------------------------
        ibc0 - daemon did not report back when launched
        ibc1 - daemon did not report back when launched

#cat ih
ibc0
ibc1

Everything is fine.
These ib interfaces are able to ping from master.

# echo $LD_LIBRARY_PATH
/opt/mpi/openmpi/1.3/intel/lib:/opt/intel/cce/10.0.023/lib:/opt/intel/fce/10.0.023/lib:/opt/intel/mkl/10.0.5.025/lib/em64t:/opt/gridengine/lib/lx26-amd6

IB tests are also working fine.
Please help us to reslove this

> Make sense?
>
> --
> Jeff Squyres
> Cisco Systems
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>