Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OpenMPI scaling > 512 cores
From: Scott Shaw (sshaw_at_[hidden])
Date: 2008-06-04 14:43:24


Hi, I was wondering if anyone had any comments with regarding to my
posting of questions. Am I off base with my questions or is this the
wrong forum for these types of questions?

>
> Hi, I hope this is the right forum for my questions. I am running
into a
> problem when scaling >512 cores on a infiniband cluster which has
14,336
> cores. I am new to openmpi and trying to figure out the right -mca
options
> to pass to avoid the "mca_oob_tcp_peer_complete_connect: connection
> failed:" on a cluster which has infiniband HCAs and OFED v1.3GA
release.
> Other MPI implementation like Intel MPI and mvapich work fine using
uDAPL
> or VERBs IB layers for MPI communications.
>
> I find it difficult to understand which network interface or IB layer
> being used. When I explicitly state not to use eth0,lo,ib1, or ib1:0
> interfaces with the cmdline option "-mca oob_tcp_exclude" openmpi will
> continue to probe these interfaces. For all MPI traffic openmpi
should
> use IB0 which is the 10.148 network. But with debugging enabled I see
> references trying the 10.149 network which is IB1. Below is the
ifconfig
> network device output for a compute node.
>
> Questions:
>
> 1. Is there away to determine which network device is being used and
not
> have openmpi fallback to another device? With Intel MPI or HP MPI you
can
> state not to use a fallback device. I thought "-mca oob_tcp_exclude"
> would be the correct option to pass but I maybe wrong.
>
> 2. How can I determine infiniband openib device is actually being
used?
> When running a MPI app I continue to see counters for in/out packets
at a
> tcp level increasing when it should be using the IB RDMA device for
all
> MPI comms over the IB0 or mtcha0 device? OpenMPI was bundled with OFED
> v1.3 so I am assuming the openib interface should work. Running
ompi_info
> shows btl_open_* references.
>
> /usr/mpi/openmpi-1.2-2/intel/bin/mpiexec -mca
> btl_openib_warn_default_gid_prefix 0 -mca oob_tcp_exclude
> eth0,lo,ib1,ib1:0 -mca btl openib,sm,self -machinefile mpd.hosts.$$
-np
> 1024 ~/bin/test_ompi < input1
>
> 3. When trying to avoid the "mca_oob_tcp_peer_complete_connect:
connection
> failed:" message I tried using "-mca btl openib,sm,self" and "-mca btl
> ^tcp" but I still get these error messages. In cases with using the
"-mca
> btl openib,sm,self" openmpi will retry to use the IB1 (10.149 net)
fabric
> to establish a connection with a node. What are my options to avoid
these
> connection failed messages? I suspect openmpi is overflowing the tcp
> buffer on the clients based on large core count of this job since I
see
> lots of tcp buffer errors based on netstat -s output. I reviewed all
of
> the online FAQs and I am not sure what options to pass to get around
this
> issue.
>
> OBTW, I did check the /usr/mpi/openmpi-1.2-2/intel/etc/openmpi-mca-
> params.conf file and no defaults are being specified.
>
> ----
>
> Ompi_info:
> Open MPI: 1.2.2
> Open MPI SVN revision: r14613
> Open RTE: 1.2.2
> Open RTE SVN revision: r14613
> OPAL: 1.2.2
> OPAL SVN revision: r14613
> Prefix: /usr/mpi/openmpi-1.2-2/intel
> Configured architecture: x86_64-suse-linux-gnu
>
> ------
>
> Following is the cluster configuration:
> 1792 nodes with 8 cores per node = 14336 cores
> Ofed Rel: OFED-1.3-rc1
> IB Device(s): mthca0 FW=1.2.0 Rate=20 Gb/sec (4X DDR) mthca1 FW=1.2.0
> Rate=20 Gb/sec (4X DDR)
> Processors: 2 x 4 Cores Intel(R) Xeon(R) CPU X5365 @ 3.00GHz 8192KB
Cache
> FSB:1333MHz
> Total Mem: 16342776 KB
> OS Release: SUSE Linux Enterprise Server 10 (x86_64) VERSION = 10 SP1
> Kernel Ver: 2.6.16.54-0.2.5-smp
>
> ------
>
> Ifconfig output:
> eth0 Link encap:Ethernet HWaddr 00:30:48:7B:A7:AC
> inet addr:192.168.159.41 Bcast:192.168.159.255
> Mask:255.255.255.0
> inet6 addr: fe80::230:48ff:fe7b:a7ac/64 Scope:Link
> UP BROADCAST NOTRAILERS RUNNING MULTICAST MTU:1500
Metric:1
> RX packets:1215826 errors:0 dropped:0 overruns:0 frame:0
> TX packets:1342035 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:1000
> RX bytes:787514337 (751.0 Mb) TX bytes:170968505 (163.0 Mb)
> Base address:0x2000 Memory:dfa00000-dfa20000
>
> ib0 Link encap:UNSPEC HWaddr
80-00-04-04-FE-80-00-00-00-00-00-00-
> 00-00-00-00
> inet addr:10.148.3.73 Bcast:10.148.255.255
Mask:255.255.0.0
> inet6 addr: fe80::230:487b:a7ac:1/64 Scope:Link
> UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1
> RX packets:20823896 errors:0 dropped:0 overruns:0 frame:0
> TX packets:19276836 errors:0 dropped:42 overruns:0 carrier:0
> collisions:0 txqueuelen:256
> RX bytes:176581223103 (168400.9 Mb) TX bytes:182691213682
> (174227.9 Mb)
>
> ib1 Link encap:UNSPEC HWaddr
80-00-04-04-FE-80-00-00-00-00-00-00-
> 00-00-00-00
> inet addr:10.149.195.73 Bcast:10.149.255.255
> Mask:255.255.192.0
> inet6 addr: fe80::230:487b:a7ad:1/64 Scope:Link
> UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1
> RX packets:175609 errors:0 dropped:0 overruns:0 frame:0
> TX packets:31175 errors:0 dropped:6 overruns:0 carrier:0
> collisions:0 txqueuelen:256
> RX bytes:139196236 (132.7 Mb) TX bytes:4515680 (4.3 Mb)
>
> ib1:0 Link encap:UNSPEC HWaddr
80-00-04-04-FE-80-00-00-00-00-00-00-
> 00-00-00-00
> inet addr:10.149.3.73 Bcast:10.149.63.255
Mask:255.255.192.0
> UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1
>
> lo Link encap:Local Loopback
> inet addr:127.0.0.1 Mask:255.0.0.0
> inet6 addr: ::1/128 Scope:Host
> UP LOOPBACK RUNNING MTU:16436 Metric:1
> RX packets:30554 errors:0 dropped:0 overruns:0 frame:0
> TX packets:30554 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:0
> RX bytes:54170543 (51.6 Mb) TX bytes:54170543 (51.6 Mb)
>
> --------
>
> Ibstatus output:
> Infiniband device 'mthca0' port 1 status:
> default gid: fe80:0000:0000:0000:0030:487c:04b4:0001
> base lid: 0x4fb
> sm lid: 0x1
> state: 4: ACTIVE
> phys state: 5: LinkUp
> rate: 20 Gb/sec (4X DDR)
>
> Infiniband device 'mthca1' port 1 status:
> default gid: fe80:0000:0000:0000:0030:487c:04b5:0001
> base lid: 0x50c
> sm lid: 0x1
> state: 4: ACTIVE
> phys state: 5: LinkUp
> rate: 20 Gb/sec (4X DDR)
>
> --------
>
> Thanks in advance,
> Scott