Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Brian Barrett (brbarret_at_[hidden])
Date: 2006-05-11 05:54:44


On May 10, 2006, at 10:46 PM, Gurhan Ozen wrote:

> My ultimate goal is to get Open MPI working with openIB stack.
> First, I had
> installed lam-mpi , I know it doesn't have support for openIB but
> it's still
> relevant to some of my questions I will ask.. Here is the set up
> I have:

Yes, keep in mind throughout that while Open MPI does support MVAPI,
LAM/MPI will fall back to using IP over IB for communication.

> I have two machines, pe830-01 and pe830-02 .. Both have ethernet
> interface and
> HCA interface. The IP addresses follow:
> eth0 ib0
> pe830-01 10.12.4.32 192.168.1.32
> pe830-02 10.12.4.34 192.168.1.34
>
> So this has worked even though it lamhosts file is configured to
> use ib0
> interfaces. I further verified with tcpdump command that none of
> this went
> to eth0 ..
>
> Anyhow, if i change the lamhosts file to use the eth0 IPs,
> things work just
> as the same with no issues . And in that case i see some traffic
> on eth0
> with tcpdump.

Ok, so at least it sounds like your TCP network is sanely configured.

> Now, when i installed and used Open MPI, things didn't work as
> easy.. Here is
> what happens. After recompiling the sources with the mpicc that
> comes with
> open-mpi:
>
> $ /usr/local/openmpi/bin/mpirun --prefix /usr/local/openmpi --mca
> pls_rsh_agent ssh --mca btl tcp -np 2 --host 10.12.4.34,10.12.4.32
> /path/to/hello_world
> Hello, world, I am 0 of 2 and this is on : pe830-02.
> Hello, world, I am 1 of 2 and this is on: pe830-01.
>
> So far so good, using eth0 interfaces.. hello_world works just
> fine. Now,
> when i try the broadcast program:

In reality, you always need to include two BTLs when specifying. You
need both the one you want to use (mvapi,openib,tcp,etc.) and
"self". You can run into issues otherwise.

> $ /usr/local/openmpi/bin/mpirun --prefix /usr/local/openmpi --mca
> pls_rsh_agent ssh --mca btl tcp -np 2 --host 10.12.4.34,10.12.4.32
> /path/to/broadcast
>
> It just hangs there, it doesn't prompt me the "Enter the vector
> length:"
> string . So i just enter a number anyway since i know the
> behavior of the
> program:
>
> 10
> Enter the vector length: i am: 0 , and i have 5 vector elements
> i am: 1 , and i have 5 vector elements
> [0] 10.000000
> [0] 10.000000
> [0] 10.000000
> [0] 10.000000
> [0] 10.000000
> [0] 10.000000
> [0] 10.000000
> [0] 10.000000
> [0] 10.000000
> [0] 10.000000
>
> So, that's the first bump with the openmpi.. Now , if i try to
> use ib0
> interfaces instead of eth0 ones, i get:

I'm actually surprised this worked in LAM/MPI, to be honest. There
should be an fflush() after the printf() to make sure that the output
is actually sent out of the application.

> $ /usr/local/openmpi/bin/mpirun --prefix /usr/local/openmpi --mca
> pls_rsh_agent ssh --mca btl openib -np 2 --host
> 192.168.1.34,192.168.1.32
> /path/to/hello_world
>
> ----------------------------------------------------------------------
> ----
> No available btl components were found!
>
> This means that there are no components of this type installed
> on your
> system or all the components reported that they could not be used.
>
> This is a fatal error; your MPI process is likely to abort.
> Check the
> output of the "ompi_info" command and ensure that components of
> this
> type are available on your system. You may also wish to check the
> value of the "component_path" MCA parameter and ensure that it
> has at
> least one directory that contains valid MCA components.
>
>
> ----------------------------------------------------------------------
> ----
> [pe830-01.domain.com:05942]
>
> I know, it thinks that it doesn't have openib components
> installed, however,
> ompi_info on both machines say otherwise:
>
> $ ompi_info | grep openib
> MCA mpool: openib (MCA v1.0, API v1.0, Component v1.0.2)
> MCA btl: openib (MCA v1.0, API v1.0, Component v1.0.2)

I don't think it will help, but can you try again with --mca btl
openib,self? For some reason, it appears that the openib component
is saying that it can't run.

> Now the questions are...
> 1- In the case of using lam/mpi over ib0 interfaces.. Does lam/mpi
> automatically just use IPoIB ?

Yes, LAM has no idea what that Open IB thing is -- it just uses the
ethernet device.

> 2 - Is there a tcpdump-like utility to dump the traffic on
> Infiniband HCAs?

I'm not aware of any, but that may occur.

> 3 - In the case of Open MPI, does --mca btl arg option have to
> be passed
> everytime? For example,
>
> $ /usr/local/openmpi/bin/mpirun --prefix /usr/local/openmpi --mca
> pls_rsh_agent ssh --mca btl tcp -np 2 --host 10.12.4.34,10.12.4.32
> /path/to/hello_world
>
> works just fine, but the same command without the "--mca btl
> tcp" bit gives
> the:
>
>
> ----------------------------------------------------------------------
> ----
> It looks like MPI_INIT failed for some reason; your parallel
> process is
> likely to abort. There are many reasons that a parallel process
> can
> fail during MPI_INIT; some of which are due to configuration or
> environment
> problems. This failure appears to be an internal failure;
> here's some
> additional information (which may only be relevant to an Open MPI
> developer):
>
> PML add procs failed
> --> Returned value -2 instead of OMPI_SUCCESS
>
> ----------------------------------------------------------------------
> ----
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (goodbye)
>
> error ...

This makes it sound like Open IB is failing to setup properly. I'm a
bit out of my league on this one -- is there any application you can run

> 4 - How come the behavior of broadcast.c was different on Open MPI
> than it is
> on lam/mpi?

I think I answered this one already.

> 5 - Any ideas as to why i am getting no btl component error when
> i want to
> use openib even though ompi_info shows it? If it help any
> further , I have
> the following openib modules :

This usually (but not always) indicates that something is going wrong
with initializing the hardware interface. ompi_info only tries to
load the module, but not initialize the network device.

Brian

-- 
   Brian Barrett
   Open MPI developer
   http://www.open-mpi.org/