Dear Jeff,
I reorganized my cluster and ran the following test with 15 nodes:
[allan@a1 bench]$ mpirun -mca btl tcp --mca btl_tcp_if_include eth1 --prefix /home/allan/openmpi -hostfile aa -np 15 ./xhpl
[0,1,11][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances] invalid interface "eth1"
[0,1,12][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances] invalid interface "eth1"
[0,1,14][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances] invalid interface "eth1"
[0,1,13][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances] invalid interface "eth1"
[allan@a1 bench]$
It spewed out the above errors but I continued the test for 2and half hours monitoring HPL.out. It gives a maximum of 21.77GFlops for 15 nodes which is not bad. I think the reason it spewed out those errors is because on the four X88-64 machines a13-16 the NIC card connected to the LAN (gigabit) are eth0 and not eth1 like the rest. The head node is eth0. I removed one NIC from the head node to make things simpler to trouble shoot.
Here is HPL.out
============================================================================
HPLinpack 1.0a -- High-Performance Linpack benchmark -- January 20, 2004
Written by A. Petitet and R. Clint Whaley, Innovative Computing Labs., UTK
============================================================================
An explanation of the input/output parameters follows:
T/V : Wall time / encoded variant.
N : The order of the coefficient matrix A.
NB : The partitioning blocking factor.
P : The number of process rows.
Q : The number of process columns.
Time : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.
The following parameter values will be used:
N : 25920
NB : 120
PMAP : Row-major process mapping
P : 3
Q : 5
PFACT : Left Crout Right
NBMIN : 2 4
NDIV : 2
RFACT : Left Crout Right
BCAST : 1ring
DEPTH : 0
SWAP : Mix (threshold = 64)
L1 : transposed form
U : transposed form
EQUIL : yes
ALIGN : 8 double precision words
----------------------------------------------------------------------------
- The matrix A is randomly generated for each test.
- The following scaled residual checks will be computed:
1) ||Ax-b||_oo / ( eps * ||A||_1 * N )
2) ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 )
3) ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo )
- The relative machine precision (eps) is taken to be 1.110223e-16
- Computational tests pass if scaled residuals are less than 16.0
============================================================================
T/V N NB P Q Time Gflops
----------------------------------------------------------------------------
WR00L2L2 25920 120 3 5 534.25 2.173e+01
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0128599 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0185612 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0037747 ...... PASSED
============================================================================
T/V N NB P Q Time Gflops
----------------------------------------------------------------------------
WR00L2L4 25920 120 3 5 536.98 2.162e+01
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0117992 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0170302 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0034634 ...... PASSED
============================================================================
T/V N NB P Q Time Gflops
----------------------------------------------------------------------------
WR00L2C2 25920 120 3 5 540.73 2.147e+01
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0128599 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0185612 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0037747 ...... PASSED
============================================================================
T/V N NB P Q Time Gflops
----------------------------------------------------------------------------
WR00L2C4 25920 120 3 5 533.76 2.175e+01
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0121362 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0175166 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0035623 ...... PASSED
============================================================================
T/V N NB P Q Time Gflops
----------------------------------------------------------------------------
WR00L2R2 25920 120 3 5 537.28 2.161e+01
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0117731 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0169925 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0034557 ...... PASSED
============================================================================
T/V N NB P Q Time Gflops
----------------------------------------------------------------------------
WR00L2R4 25920 120 3 5 533.38 2.177e+01
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0109683 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0158310 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0032195 ...... PASSED
============================================================================
T/V N NB P Q Time Gflops
----------------------------------------------------------------------------
WR00C2L2 25920 120 3 5 540.45 2.148e+01
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0128599 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0185612 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0037747 ...... PASSED
============================================================================
T/V N NB P Q Time Gflops
----------------------------------------------------------------------------
WR00C2L4 25920 120 3 5 536.87 2.163e+01
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0117992 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0170302 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0034634 ...... PASSED
============================================================================
T/V N NB P Q Time Gflops
----------------------------------------------------------------------------
WR00C2C2 25920 120 3 5 533.98 2.174e+01
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0128599 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0185612 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0037747 ...... PASSED
============================================================================
T/V N NB P Q Time Gflops
----------------------------------------------------------------------------
WR00C2C4 25920 120 3 5 535.31 2.169e+01
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0121362 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0175166 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0035623 ...... PASSED
============================================================================
T/V N NB P Q Time Gflops
----------------------------------------------------------------------------
WR00C2R2 25920 120 3 5 536.65 2.164e+01
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0117731 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0169925 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0034557 ...... PASSED
============================================================================
T/V N NB P Q Time Gflops
----------------------------------------------------------------------------
WR00C2R4 25920 120 3 5 536.97 2.162e+01
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0109683 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0158310 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0032195 ...... PASSED
============================================================================
T/V N NB P Q Time Gflops
----------------------------------------------------------------------------
WR00R2L2 25920 120 3 5 534.09 2.174e+01
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0128599 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0185612 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0037747 ...... PASSED
============================================================================
T/V N NB P Q Time Gflops
----------------------------------------------------------------------------
WR00R2L4 25920 120 3 5 534.96 2.170e+01
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0117992 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0170302 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0034634 ...... PASSED
============================================================================
T/V N NB P Q Time Gflops
----------------------------------------------------------------------------
WR00R2C2 25920 120 3 5 536.73 2.163e+01
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0128599 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0185612 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0037747 ...... PASSED
============================================================================
T/V N NB P Q Time Gflops
----------------------------------------------------------------------------
WR00R2C4 25920 120 3 5 536.91 2.162e+01
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0121362 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0175166 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0035623 ...... PASSED
============================================================================
T/V N NB P Q Time Gflops
----------------------------------------------------------------------------
WR00R2R2 25920 120 3 5 535.96 2.166e+01
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0117731 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0169925 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0034557 ...... PASSED
============================================================================
T/V N NB P Q Time Gflops
----------------------------------------------------------------------------
WR00R2R4 25920 120 3 5 536.16 2.165e+01
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0109683 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0158310 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0032195 ...... PASSED
============================================================================
Finished 18 tests with the following results:
18 tests completed and passed residual checks,
0 tests completed and failed residual checks,
0 tests skipped because of illegal input values.
----------------------------------------------------------------------------
End of Tests.
============================================================================
Here is the result of the test carried out with --mca btl tcp --mca btl_tcp_if_include eth1,eth0 which hangs.
[allan@a1 bench]$ mpirun -mca btl tcp --mca btl_tcp_if_include eth1,eth0 --prefix /home/allan/openmpi -hostfile aa -np 15 ./xhpl [0,1,1][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances] invalid interface "eth0"
[0,1,6][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances] invalid interface "eth0"
[0,1,7][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances] invalid interface "eth0"
[0,1,12][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances] invalid interface "eth1"
[0,1,11][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances] invalid interface "eth1"
[0,1,2][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances] invalid interface "eth0"
[0,1,3][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances] invalid interface "eth0"
[0,1,8][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances] invalid interface "eth0"
[0,1,4][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances] invalid interface "eth0"
[0,1,5][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances] invalid interface "eth0"
[0,1,10][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances] invalid interface "eth0"
[0,1,13][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances] invalid interface "eth1"
[0,1,14][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances] invalid interface "eth1"
[0,1,9][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances] invalid interface "eth0"
Tell me if I connect all the 10/100Mbps cards on a 10/100Mbps switch along with the gigabit and specify as before --mca btl_tcp_if_include eth1,eth0
the problem will go away and I may get increased bandwidth.
I will be trying the same with the switches pml teg to see if there is a difference!
Thank you,
Allan
Message: 2
Date: Sun, 13 Nov 2005 15:51:30 -0500
From: Jeff Squyres
<jsquyres@open-mpi.org>
Subject: Re: [O-MPI users] HPL anf TCP
To: Open MPI Users
<users@open-mpi.org>
Message-ID:
<f143e44670c59a2f345708e6e0fad549@open-mpi.org>
Content-Type: text/plain; charset=US-ASCII; format=flowed
On Nov 3, 2005, at 8:35 PM, Allan Menezes wrote:
> 1. No, I have 4 NICs on the head node and two on each of the 15 other
> compute nodes. I use the realtek 8169 gigabit ethernet cards on the
> compute nodes as eth1 or eth0(one only) connected to a gigabit
> ethernet switch with bisection bandwidth of 32Gbps and a sk98lin
> driver 3Com built in gigabit ethernet NIC card on the head node(eth3).
> The other ethernet cards 10/100M on the head node handle a network
> laser printer(eth0) and eth2 (10/100M) internet access. Eth1 is a
> spare 10/100M which I can remove. The compute nodes each have two
> ethernet cards one 10/100Mbps ethernet not connected to anything(built
> in to M/B) and a PCI realtek 8169 gigabit ethernet connected to the
> TCP network LAN(Gigabit). When I tried it without the switches -mca
> pml teg the maximum performace I would get with it was 9GFlops for P=4
> Q=4 N=approx 12- 16 thousand and NB ridiculously low at 10 block size.
> If I tried bigger block sizes it would run for along time for large N
> ~ 16,000 unless I killed xhpl. I use atlas BLAS 3.7.11 libs compiled
> for each node and linked to HPL when creating xhpl. I also use open
> mpi mpicc in the hpl make file for compile and link both. Maybe I
> should according to the new faq use the TCP switch to use eth3 on the
> head node?
So if I'm reading that right, there's only one network that connects
the head node and the compute nodes, right?
> 2. I have 512MB of memory per node which is 8 GB total, so I can
> safely go upto N=22,000 24,000. I used sizes of 22000 for TCP teg and
> did not run into problems. But if I do not specify the switches
> suggested by Tim I get bad performance for N = 12000.
I must admit that I'm still befuddled by this -- we are absolutely
unable to duplicate this behavior. It *sounds* like there is some
network mismatching going on in here -- that the tcp btl is somehow
routing information differently than the tcp ptl (and therefore taking
longer -- timing out and the like).
We did make some improvements to the tcp subnet mask matching code for
rc5; I had to ask again, but could you try with the latest nightly
snapshot tarball?
http://www.open-mpi.org/nightly/v1.0/
> 4. My cluster is an experimental Basement Cluster [BSquared = Brampton
> Beowulf] built out of x86 Durons(6), 2 athlons, 2 semprons, two P4s, 2
> 64 bit x86_64 AMD64 ATHLONS and two AMD x86_64 SEmprons(754 pin) for a
> total of 16 machines running FC3 and Oscar beta cluster software. I
> have not tried it with the latest open mpi snapshot yet but I will to
> night. I think I should reinstall FC3 on the head node P4 2.8GHz and
> reinstall all the compute nodes with Oscar beta Nov 3, 2005 and open
> mpi of todays Nov 3, 2005 1.0 snapshot and try again. I could have
> made an errror somewhere before. It should not take me long. But I
> doubt it as MPICH2 and open mpi with the switches pml teg give good
> comparable performance. I was not using jumo MTU frames either just
> 1500bytes. It is not homogenous (BSquared) but a good test set up.
> If you have any advice, Please tell me and I could try it out.
> Thank you and good luck!
> Allan
>
>
>
>
>
> On Oct 27, 2005, at 10:19 AM, Jeff Squyres wrote:
>
>
>> > On Oct 19, 2005, at 12:04 AM, Allan Menezes wrote:
>> >
>> >
>>
>>> >> We've done linpack runs recently w/ Infiniband, which result in
>>> >> performance
>>> >> comparable to mvapich, but not w/ the tcp port. Can you try
>>> >> running w/
>>> >> an
>>> >> earlier version, specify on the command line:
>>> >>
>>> >> -mca pml teg
>>> >> Hi Tim,
>>> >> I tried the same cluster (16 node x86) with the switches -mca
>>> pml
>>> >> teg and I get good performance of 24.52Gflops at N=22500
>>> >> and Block size NB=120.
>>> >> My command line now looks like :
>>> >> a1> mpirun -mca pls_rsh_orted /home/allan/openmpi/bin/orted -mca
>>> pml
>>> >> teg -hostile aa -np 16 ./xhpl
>>> >> hostfile = aa, containing the addresses of the 16 machines.
>>> >> I am using a GS116 16 port netgear Gigabit ethernet switch with
>>> Gnet
>>> >> realtek gig ethernet cards
>>> >> Why, PLEASE, do these switches pml teg make such a difference?
>>> It's
>>> >> 2.6 times more performance in GFlops than what I was getting
>>> without
>>> >> them.
>>> >> I tried version rc3 and not an earlier version.
>>> >> Thank you very much for your assistance!
>>> >>
>>>
>> >
>> > Sorry for the delay in replying to this...
>> >
>> > The "pml teg" switch tells Open MPI to use the 2nd generation TCP
>> > implementation rather than the 3rd generation TCP. More
>> specifically,
>> > the "PML" is the point-to-point management layer. There are 2
>> > different components for this -- teg (2nd generation) and ob1 (3rd
>> > generation). "ob1" is the default; specifying "--mca pml teg" tells
>> > Open MPI to use the "teg" component instead of ob1.
>> >
>> > Note, however, that teg and ob1 know nothing about TCP -- it's the
>> 2nd
>> > order implications that make the difference here. teg and ob1 use
>> > different back-end components to talk across networks:
>> >
>> > - teg uses PTL components (point-to-point transport layer -- 2nd
>> gen)
>> > - ob1 uses BTL components (byte transfer layer -- 3rd gen)
>> >
>> > We obviously have TCP implementations for both the PTL and BTL.
>> > Considerable time was spent optimizing the TCP PTL (i.e., 2nd gen).
>> > Unfortunately, as yet, little time has been spent optimizing the TCP
>> > BTL (i.e., 3rd gen) -- it was a simple port, nothing more.
>> >
>> > We have spent the majority of our time, so far, optimizing the
>> Myrinet
>> > and Infiniband BTLs (therefore showing that excellent performance is
>> > achievable in the BTLs). However, I'm quite disappointed by the TCP
>> > BTL performance -- it sounds like we have a protocol mismatch that
>> is
>> > arbitrarily slowing everything down, and something that needs to be
>> > fixed before 1.0 (it's not a problem with the BTL design, since IB
>> and
>> > Myrinet performance is quite good -- just a problem/bug in the TCP
>> > implementation of the BTL). That much performance degradation is
>> > clearly unacceptable.
>> >
>> > --
>> > {+} Jeff Squyres
>> > {+} The Open MPI Project
>> > {+} http://www.open-mpi.org/
>> >
>> > _______________________________________________
>> > users mailing list
>> > users@open-mpi.org
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >
>>
>
>
>
> --
> _______________________________________________
> users mailing list
> users@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users