Hi, I hope this is the right forum for my questions. I am running into
a problem when scaling >512 cores on a infiniband cluster which has
14,336 cores. I am new to openmpi and trying to figure out the right
-mca options to pass to avoid the "mca_oob_tcp_peer_complete_connect:
connection failed:" on a cluster which has infiniband HCAs and OFED
v1.3GA release. Other MPI implementation like Intel MPI and mvapich
work fine using uDAPL or VERBs IB layers for MPI communications.
I find it difficult to understand which network interface or IB layer
being used. When I explicitly state not to use eth0,lo,ib1, or ib1:0
interfaces with the cmdline option "-mca oob_tcp_exclude" openmpi will
continue to probe these interfaces. For all MPI traffic openmpi should
use IB0 which is the 10.148 network. But with debugging enabled I see
references trying the 10.149 network which is IB1. Below is the
ifconfig network device output for a compute node.
Questions:
1. Is there away to determine which network device is being used and not
have openmpi fallback to another device? With Intel MPI or HP MPI you
can state not to use a fallback device. I thought "-mca
oob_tcp_exclude" would be the correct option to pass but I maybe wrong.
2. How can I determine infiniband openib device is actually being used?
When running a MPI app I continue to see counters for in/out packets at
a tcp level increasing when it should be using the IB RDMA device for
all MPI comms over the IB0 or mtcha0 device? OpenMPI was bundled with
OFED v1.3 so I am assuming the openib interface should work. Running
ompi_info shows btl_open_* references.
/usr/mpi/openmpi-1.2-2/intel/bin/mpiexec -mca
btl_openib_warn_default_gid_prefix 0 -mca oob_tcp_exclude
eth0,lo,ib1,ib1:0 -mca btl openib,sm,self -machinefile mpd.hosts.$$ -np
1024 ~/bin/test_ompi < input1
3. When trying to avoid the "mca_oob_tcp_peer_complete_connect:
connection failed:" message I tried using "-mca btl openib,sm,self" and
"-mca btl ^tcp" but I still get these error messages. In cases with
using the "-mca btl openib,sm,self" openmpi will retry to use the IB1
(10.149 net) fabric to establish a connection with a node. What are my
options to avoid these connection failed messages? I suspect openmpi is
overflowing the tcp buffer on the clients based on large core count of
this job since I see lots of tcp buffer errors based on netstat -s
output. I reviewed all of the online FAQs and I am not sure what options
to pass to get around this issue.
OBTW, I did check the
/usr/mpi/openmpi-1.2-2/intel/etc/openmpi-mca-params.conf file and no
defaults are being specified.
----
Ompi_info:
Open MPI: 1.2.2
Open MPI SVN revision: r14613
Open RTE: 1.2.2
Open RTE SVN revision: r14613
OPAL: 1.2.2
OPAL SVN revision: r14613
Prefix: /usr/mpi/openmpi-1.2-2/intel
Configured architecture: x86_64-suse-linux-gnu
------
Following is the cluster configuration:
1792 nodes with 8 cores per node = 14336 cores
Ofed Rel: OFED-1.3-rc1
IB Device(s): mthca0 FW=1.2.0 Rate=20 Gb/sec (4X DDR) mthca1 FW=1.2.0
Rate=20 Gb/sec (4X DDR)
Processors: 2 x 4 Cores Intel(R) Xeon(R) CPU X5365 @ 3.00GHz 8192KB
Cache FSB:1333MHz
Total Mem: 16342776 KB
OS Release: SUSE Linux Enterprise Server 10 (x86_64) VERSION = 10 SP1
Kernel Ver: 2.6.16.54-0.2.5-smp
------
Ifconfig output:
eth0 Link encap:Ethernet HWaddr 00:30:48:7B:A7:AC
inet addr:192.168.159.41 Bcast:192.168.159.255
Mask:255.255.255.0
inet6 addr: fe80::230:48ff:fe7b:a7ac/64 Scope:Link
UP BROADCAST NOTRAILERS RUNNING MULTICAST MTU:1500 Metric:1
RX packets:1215826 errors:0 dropped:0 overruns:0 frame:0
TX packets:1342035 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:787514337 (751.0 Mb) TX bytes:170968505 (163.0 Mb)
Base address:0x2000 Memory:dfa00000-dfa20000
ib0 Link encap:UNSPEC HWaddr
80-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00
inet addr:10.148.3.73 Bcast:10.148.255.255 Mask:255.255.0.0
inet6 addr: fe80::230:487b:a7ac:1/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1
RX packets:20823896 errors:0 dropped:0 overruns:0 frame:0
TX packets:19276836 errors:0 dropped:42 overruns:0 carrier:0
collisions:0 txqueuelen:256
RX bytes:176581223103 (168400.9 Mb) TX bytes:182691213682
(174227.9 Mb)
ib1 Link encap:UNSPEC HWaddr
80-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00
inet addr:10.149.195.73 Bcast:10.149.255.255
Mask:255.255.192.0
inet6 addr: fe80::230:487b:a7ad:1/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1
RX packets:175609 errors:0 dropped:0 overruns:0 frame:0
TX packets:31175 errors:0 dropped:6 overruns:0 carrier:0
collisions:0 txqueuelen:256
RX bytes:139196236 (132.7 Mb) TX bytes:4515680 (4.3 Mb)
ib1:0 Link encap:UNSPEC HWaddr
80-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00
inet addr:10.149.3.73 Bcast:10.149.63.255 Mask:255.255.192.0
UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:30554 errors:0 dropped:0 overruns:0 frame:0
TX packets:30554 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:54170543 (51.6 Mb) TX bytes:54170543 (51.6 Mb)
--------
Ibstatus output:
Infiniband device 'mthca0' port 1 status:
default gid: fe80:0000:0000:0000:0030:487c:04b4:0001
base lid: 0x4fb
sm lid: 0x1
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 20 Gb/sec (4X DDR)
Infiniband device 'mthca1' port 1 status:
default gid: fe80:0000:0000:0000:0030:487c:04b5:0001
base lid: 0x50c
sm lid: 0x1
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 20 Gb/sec (4X DDR)
--------
Thanks in advance,
Scott
|