Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OpenMPI scaling > 512 cores
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2008-06-04 15:47:32

First and foremost: is it possible to upgrade your version of Open
MPI? The version you are using (1.2.2) is rather ancient -- many bug
fixes have occurred since then (including TCP wireup issues). Note
that oob_tcp_in|exclude were renamed to be oob_tcp_if_in|exclude in
1.2.3 to be symmetric with other <foo>_if_in|exclude params in other

More below.

On Jun 3, 2008, at 1:07 PM, Scott Shaw wrote:

> Hi, I hope this is the right forum for my questions. I am running
> into
> a problem when scaling >512 cores on a infiniband cluster which has
> 14,336 cores. I am new to openmpi and trying to figure out the right
> -mca options to pass to avoid the "mca_oob_tcp_peer_complete_connect:
> connection failed:" on a cluster which has infiniband HCAs and OFED
> v1.3GA release. Other MPI implementation like Intel MPI and mvapich
> work fine using uDAPL or VERBs IB layers for MPI communications.

The OMPI v1.2 series is a bit inefficient in its TCP wireup for
control messages -- it creates TCP sockets between all MPI processes.
Do you allow enough fd's per process to allow this to occur?

(this situation is considerably better in the upcoming v1.3 series)

> I find it difficult to understand which network interface or IB layer
> being used. When I explicitly state not to use eth0,lo,ib1, or ib1:0
> interfaces with the cmdline option "-mca oob_tcp_exclude" openmpi will
> continue to probe these interfaces. For all MPI traffic openmpi
> should
> use IB0 which is the 10.148 network. But with debugging enabled I see
> references trying the 10.149 network which is IB1. Below is the
> ifconfig network device output for a compute node.

Just curious: does the oob_tcp_include parameter not work?

> Questions:
> 1. Is there away to determine which network device is being used and
> not
> have openmpi fallback to another device? With Intel MPI or HP MPI you
> can state not to use a fallback device. I thought "-mca
> oob_tcp_exclude" would be the correct option to pass but I maybe
> wrong.

oob_tcp_in|exclude should be suitable for this purpose. If they're
not working, I'd be surprised (but it could have been a bug that was
fixed in a later version...?). Keep in mind that the "oob" traffic is
just control messages -- it's not the actual MPI communication. That
will go over the verbs interfaces.

> 2. How can I determine infiniband openib device is actually being
> used?
> When running a MPI app I continue to see counters for in/out packets
> at
> a tcp level increasing when it should be using the IB RDMA device for
> all MPI comms over the IB0 or mtcha0 device? OpenMPI was bundled with
> OFED v1.3 so I am assuming the openib interface should work. Running
> ompi_info shows btl_open_* references.
> /usr/mpi/openmpi-1.2-2/intel/bin/mpiexec -mca
> btl_openib_warn_default_gid_prefix 0 -mca oob_tcp_exclude
> eth0,lo,ib1,ib1:0 -mca btl openib,sm,self -machinefile mpd.hosts.$$
> -np
> 1024 ~/bin/test_ompi < input1

The "btl" is the component that controls point-to-point communication
in Open MPI. so if you specify "openib,sm,self", then Open MPI is
definitely using the verbs stack for MPI communication (not a TCP

> 3. When trying to avoid the "mca_oob_tcp_peer_complete_connect:
> connection failed:" message I tried using "-mca btl openib,sm,self"
> and
> "-mca btl ^tcp" but I still get these error messages.

Unfortunately, these are two different issues -- OMPI always uses TCP
for wireup and out-of-band control messages. That's where you're
getting the errors from. Specifically: giving values for the btl MCA
parameter won't affect these messages / errors.

> In cases with
> using the "-mca btl openib,sm,self" openmpi will retry to use the IB1
> (10.149 net) fabric to establish a connection with a node. What are
> my
> options to avoid these connection failed messages? I suspect
> openmpi is
> overflowing the tcp buffer on the clients based on large core count of
> this job since I see lots of tcp buffer errors based on netstat -s
> output. I reviewed all of the online FAQs and I am not sure what
> options
> to pass to get around this issue.

I think we made this much better in 1.2.5 -- I see notes about this
issue in the NEWS file under the 1.2.5 release.

Jeff Squyres
Cisco Systems