Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Cluster with IB hosts and Ethernet hosts
From: Sangamesh B (forum.san_at_[hidden])
Date: 2009-01-23 10:38:07


Any solution for the following problem?

On Fri, Jan 23, 2009 at 7:58 PM, Sangamesh B <forum.san_at_[hidden]> wrote:
> On Fri, Jan 23, 2009 at 5:41 PM, Jeff Squyres <jsquyres_at_[hidden]> wrote:
>> On Jan 22, 2009, at 11:26 PM, Sangamesh B wrote:
>>
>>> We''ve a cluster with 23 nodes connected to IB switch and 8 nodes
>>> have connected to ethernet switch. Master node is also connected to IB
>>> switch. SGE(with tight integration, -pe orte) is used for
>>> parallel/serial job submission.
>>>
>>> Open MPI-1.3 is installed on master node with IB support
>>> (--with-openib=/usr). The same folder is copied to the remaining 23 IB
>>> nodes.
>>
>> Sounds good.
>>
>>> Now what shall I do for remaining 8 ethernet nodes:
>>> (1) Copy the same folder(IB) to these nodes
>>> (2) Install Open MPI on one of the 8 eight ethernet nodes. Copy the
>>> same to 7 nodes.
>>> (3) Install an ethernet version of Open MPI on master node and copy to 8
>>> nodes.
>>
>> Either 1 or 2 is your best bet.
>>
>> Do you have OFED installed on all nodes (either explicitly, or included in
>> your Linux distro)?
> No
>>
>> If so, I believe that at least some users with configurations like this
>> install OMPI with OFED support (--with-openib=/usr, as you mentioned above)
>> on all nodes. OMPI will notice that there is no OpenFabrics-capable
>> hardware on the ethernet-only nodes and will simply not use the openib BTL
>> plugin.
>>
>> Note that OMPI v1.3 got better about being silent about the lack of
>> OpenFabrics devices when the openib BTL is present (OMPI v1.2 issued a
>> warning about this).
>>
>> How you intend to use this setup is up to you; you may want to restrict jobs
>> to 100% IB or 100% ethernet via SGE, or you may want to let them mix,
>> realizing that the overall parallel job may be slowed down to the speed of
>> the slowest network (e.g., ethernet).
>>
>
> Now I've two basic problems:
>
> (1) Open MPI 1.3 is configurred as:
> # ./configure --prefix=/opt/mpi/openmpi/1.3/intel --with-sge
> --with-openib=/usr | tee config_out
>
> But,
>
> /opt/mpi/openmpi/1.3/intel/bin/ompi_info | grep gridengine
> MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.3)
>
> shows only one component. Is this ok?
>
> (2) Open MPI is itself not working
> ssh: connect to host chemfist.iitk.ac.in port 22: Connection timed out
> A daemon (pid 31343) died unexpectedly with status 1 while attempting
> to launch so we are aborting.
>
> There may be more information reported by the environment (see above).
>
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --------------------------------------------------------------------------
> mpirun: clean termination accomplished
>
>
> On two nodes:
>
> # /opt/mpi/openmpi/1.3/intel/bin/mpirun -np 2 -hostfile ih hostname
> bash: /opt/mpi/openmpi/1.3/intel/bin/orted: No such file or directory
> --------------------------------------------------------------------------
> A daemon (pid 31184) died unexpectedly with status 127 while attempting
> to launch so we are aborting.
>
> There may be more information reported by the environment (see above).
>
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun was unable to cleanly terminate the daemons on the nodes shown
> below. Additional manual cleanup may be required - please refer to
> the "orte-clean" tool for assistance.
> --------------------------------------------------------------------------
> ibc0 - daemon did not report back when launched
> ibc1 - daemon did not report back when launched
>
>
> #cat ih
> ibc0
> ibc1
>
> Everything is fine.
> These ib interfaces are able to ping from master.
>
> # echo $LD_LIBRARY_PATH
> /opt/mpi/openmpi/1.3/intel/lib:/opt/intel/cce/10.0.023/lib:/opt/intel/fce/10.0.023/lib:/opt/intel/mkl/10.0.5.025/lib/em64t:/opt/gridengine/lib/lx26-amd6
>
> IB tests are also working fine.
> Please help us to reslove this
>
>> Make sense?
>>
>> --
>> Jeff Squyres
>> Cisco Systems
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>