Dear Jeff, I reorganized my cluster and ran the following test with 15 nodes: [allan@a1 bench]$ mpirun -mca btl tcp --mca btl_tcp_if_include eth1 --prefix /home/allan/openmpi -hostfile aa -np 15 ./xhpl [0,1,11][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances] invalid interface "eth1" [0,1,12][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances] invalid interface "eth1" [0,1,14][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances] invalid interface "eth1" [0,1,13][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances] invalid interface "eth1" [allan@a1 bench]$ It spewed out the above errors but I continued the test for 2and half hours monitoring HPL.out. It gives a maximum of 21.77GFlops for 15 nodes which is not bad. I think the reason it spewed out those errors is because on the four X88-64 machines a13-16 the NIC card connected to the LAN (gigabit) are eth0 and not eth1 like the rest. The head node is eth0. I removed one NIC from the head node to make things simpler to trouble shoot. Here is HPL.out ============================================================================ HPLinpack 1.0a -- High-Performance Linpack benchmark -- January 20, 2004 Written by A. Petitet and R. Clint Whaley, Innovative Computing Labs., UTK ============================================================================ An explanation of the input/output parameters follows: T/V : Wall time / encoded variant. N : The order of the coefficient matrix A. NB : The partitioning blocking factor. P : The number of process rows. Q : The number of process columns. Time : Time in seconds to solve the linear system. Gflops : Rate of execution for solving the linear system. The following parameter values will be used: N : 25920 NB : 120 PMAP : Row-major process mapping P : 3 Q : 5 PFACT : Left Crout Right NBMIN : 2 4 NDIV : 2 RFACT : Left Crout Right BCAST : 1ring DEPTH : 0 SWAP : Mix (threshold = 64) L1 : transposed form U : transposed form EQUIL : yes ALIGN : 8 double precision words ---------------------------------------------------------------------------- - The matrix A is randomly generated for each test. - The following scaled residual checks will be computed: 1) ||Ax-b||_oo / ( eps * ||A||_1 * N ) 2) ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) 3) ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) - The relative machine precision (eps) is taken to be 1.110223e-16 - Computational tests pass if scaled residuals are less than 16.0 ============================================================================ T/V N NB P Q Time Gflops ---------------------------------------------------------------------------- WR00L2L2 25920 120 3 5 534.25 2.173e+01 ---------------------------------------------------------------------------- ||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0128599 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0185612 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0037747 ...... PASSED ============================================================================ T/V N NB P Q Time Gflops ---------------------------------------------------------------------------- WR00L2L4 25920 120 3 5 536.98 2.162e+01 ---------------------------------------------------------------------------- ||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0117992 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0170302 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0034634 ...... PASSED ============================================================================ T/V N NB P Q Time Gflops ---------------------------------------------------------------------------- WR00L2C2 25920 120 3 5 540.73 2.147e+01 ---------------------------------------------------------------------------- ||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0128599 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0185612 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0037747 ...... PASSED ============================================================================ T/V N NB P Q Time Gflops ---------------------------------------------------------------------------- WR00L2C4 25920 120 3 5 533.76 2.175e+01 ---------------------------------------------------------------------------- ||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0121362 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0175166 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0035623 ...... PASSED ============================================================================ T/V N NB P Q Time Gflops ---------------------------------------------------------------------------- WR00L2R2 25920 120 3 5 537.28 2.161e+01 ---------------------------------------------------------------------------- ||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0117731 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0169925 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0034557 ...... PASSED ============================================================================ T/V N NB P Q Time Gflops ---------------------------------------------------------------------------- WR00L2R4 25920 120 3 5 533.38 2.177e+01 ---------------------------------------------------------------------------- ||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0109683 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0158310 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0032195 ...... PASSED ============================================================================ T/V N NB P Q Time Gflops ---------------------------------------------------------------------------- WR00C2L2 25920 120 3 5 540.45 2.148e+01 ---------------------------------------------------------------------------- ||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0128599 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0185612 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0037747 ...... PASSED ============================================================================ T/V N NB P Q Time Gflops ---------------------------------------------------------------------------- WR00C2L4 25920 120 3 5 536.87 2.163e+01 ---------------------------------------------------------------------------- ||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0117992 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0170302 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0034634 ...... PASSED ============================================================================ T/V N NB P Q Time Gflops ---------------------------------------------------------------------------- WR00C2C2 25920 120 3 5 533.98 2.174e+01 ---------------------------------------------------------------------------- ||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0128599 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0185612 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0037747 ...... PASSED ============================================================================ T/V N NB P Q Time Gflops ---------------------------------------------------------------------------- WR00C2C4 25920 120 3 5 535.31 2.169e+01 ---------------------------------------------------------------------------- ||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0121362 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0175166 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0035623 ...... PASSED ============================================================================ T/V N NB P Q Time Gflops ---------------------------------------------------------------------------- WR00C2R2 25920 120 3 5 536.65 2.164e+01 ---------------------------------------------------------------------------- ||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0117731 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0169925 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0034557 ...... PASSED ============================================================================ T/V N NB P Q Time Gflops ---------------------------------------------------------------------------- WR00C2R4 25920 120 3 5 536.97 2.162e+01 ---------------------------------------------------------------------------- ||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0109683 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0158310 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0032195 ...... PASSED ============================================================================ T/V N NB P Q Time Gflops ---------------------------------------------------------------------------- WR00R2L2 25920 120 3 5 534.09 2.174e+01 ---------------------------------------------------------------------------- ||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0128599 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0185612 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0037747 ...... PASSED ============================================================================ T/V N NB P Q Time Gflops ---------------------------------------------------------------------------- WR00R2L4 25920 120 3 5 534.96 2.170e+01 ---------------------------------------------------------------------------- ||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0117992 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0170302 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0034634 ...... PASSED ============================================================================ T/V N NB P Q Time Gflops ---------------------------------------------------------------------------- WR00R2C2 25920 120 3 5 536.73 2.163e+01 ---------------------------------------------------------------------------- ||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0128599 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0185612 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0037747 ...... PASSED ============================================================================ T/V N NB P Q Time Gflops ---------------------------------------------------------------------------- WR00R2C4 25920 120 3 5 536.91 2.162e+01 ---------------------------------------------------------------------------- ||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0121362 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0175166 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0035623 ...... PASSED ============================================================================ T/V N NB P Q Time Gflops ---------------------------------------------------------------------------- WR00R2R2 25920 120 3 5 535.96 2.166e+01 ---------------------------------------------------------------------------- ||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0117731 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0169925 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0034557 ...... PASSED ============================================================================ T/V N NB P Q Time Gflops ---------------------------------------------------------------------------- WR00R2R4 25920 120 3 5 536.16 2.165e+01 ---------------------------------------------------------------------------- ||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0109683 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0158310 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0032195 ...... PASSED ============================================================================ Finished 18 tests with the following results: 18 tests completed and passed residual checks, 0 tests completed and failed residual checks, 0 tests skipped because of illegal input values. ---------------------------------------------------------------------------- End of Tests. ============================================================================ Here is the result of the test carried out with --mca btl tcp --mca btl_tcp_if_include eth1,eth0 which hangs. [allan@a1 bench]$ mpirun -mca btl tcp --mca btl_tcp_if_include eth1,eth0 --prefix /home/allan/openmpi -hostfile aa -np 15 ./xhpl [0,1,1][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances] invalid interface "eth0" [0,1,6][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances] invalid interface "eth0" [0,1,7][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances] invalid interface "eth0" [0,1,12][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances] invalid interface "eth1" [0,1,11][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances] invalid interface "eth1" [0,1,2][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances] invalid interface "eth0" [0,1,3][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances] invalid interface "eth0" [0,1,8][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances] invalid interface "eth0" [0,1,4][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances] invalid interface "eth0" [0,1,5][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances] invalid interface "eth0" [0,1,10][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances] invalid interface "eth0" [0,1,13][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances] invalid interface "eth1" [0,1,14][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances] invalid interface "eth1" [0,1,9][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances] invalid interface "eth0" Tell me if I connect all the 10/100Mbps cards on a 10/100Mbps switch along with the gigabit and specify as before --mca btl_tcp_if_include eth1,eth0 the problem will go away and I may get increased bandwidth. I will be trying the same with the switches pml teg to see if there is a difference! Thank you, Allan Message: 2 Date: Sun, 13 Nov 2005 15:51:30 -0500 From: Jeff Squyres <jsquyres@open-mpi.org> Subject: Re: [O-MPI users] HPL anf TCP To: Open MPI Users <users@open-mpi.org> Message-ID: <f143e44670c59a2f345708e6e0fad549@open-mpi.org> Content-Type: text/plain; charset=US-ASCII; format=flowed On Nov 3, 2005, at 8:35 PM, Allan Menezes wrote:
> 1. No, I have 4 NICs on the head node and two on each of the 15 other 
> compute nodes. I use the realtek 8169 gigabit ethernet cards on the 
> compute nodes as eth1 or eth0(one only) connected to a gigabit 
> ethernet switch with bisection bandwidth of 32Gbps and a sk98lin 
> driver 3Com built in gigabit ethernet NIC card on the head node(eth3). 
> The other ethernet cards 10/100M on the head node handle a network 
> laser printer(eth0) and eth2 (10/100M) internet access. Eth1 is a 
> spare 10/100M which I can remove. The compute nodes each have two 
> ethernet cards one 10/100Mbps ethernet not connected to anything(built 
> in to M/B) and a PCI realtek 8169 gigabit ethernet connected to the 
> TCP network LAN(Gigabit). When I tried it without the switches -mca 
> pml teg the maximum performace I would get with it was 9GFlops for P=4 
> Q=4 N=approx 12- 16 thousand and NB ridiculously low at 10 block size. 
> If I tried bigger block sizes it would run for along time for large N 
> ~ 16,000 unless I killed xhpl. I use atlas BLAS 3.7.11 libs compiled 
> for each node and linked to HPL when creating xhpl. I also use open 
> mpi mpicc in the hpl make file for compile and link both.  Maybe I 
> should according to the new faq use the TCP switch to use eth3 on the 
> head node?
  

So if I'm reading that right, there's only one network that connects 
the head node and the compute nodes, right?

> 2. I have 512MB of memory per node which is 8 GB total, so I can 
> safely go upto N=22,000 24,000. I used sizes of 22000 for TCP teg and 
> did not run into problems. But if I do not specify the switches 
> suggested by Tim I get bad performance for N = 12000.
  

I must admit that I'm still befuddled by this -- we are absolutely 
unable to duplicate this behavior.  It *sounds* like there is some 
network mismatching going on in here -- that the tcp btl is somehow 
routing information differently than the tcp ptl (and therefore taking 
longer -- timing out and the like).

We did make some improvements to the tcp subnet mask matching code for 
rc5; I had to ask again, but could you try with the latest nightly 
snapshot tarball?

	http://www.open-mpi.org/nightly/v1.0/

> 4. My cluster is an experimental Basement Cluster [BSquared = Brampton 
> Beowulf] built out of x86 Durons(6), 2 athlons, 2 semprons, two P4s, 2 
> 64 bit x86_64 AMD64 ATHLONS and two AMD x86_64 SEmprons(754 pin) for a 
> total of 16 machines running FC3 and Oscar beta cluster software. I 
> have not tried it with the latest open mpi snapshot yet but I will to 
> night. I think I should reinstall FC3 on the head node P4 2.8GHz and 
> reinstall all the compute nodes with Oscar beta Nov 3, 2005 and open 
> mpi of todays Nov 3, 2005 1.0 snapshot and try again. I could have 
> made an errror somewhere before. It should not take me long. But I 
> doubt it as MPICH2 and open mpi with the switches pml teg give good 
> comparable performance. I was not using jumo MTU frames either just 
> 1500bytes. It is not homogenous (BSquared) but a good test set up.
> If you have any advice, Please tell me and I could try it out.
> Thank you and good luck!
> Allan
>
>
>
>
>
> On Oct 27, 2005, at 10:19 AM, Jeff Squyres wrote:
>
>
  
>> > On Oct 19, 2005, at 12:04 AM, Allan Menezes wrote:
>> >
>> >
      
>>
    
>>> >> We've done linpack runs recently w/ Infiniband, which result in
>>> >> performance
>>> >> comparable to mvapich, but not w/ the tcp port. Can you try
>>> >> running w/
>>> >> an
>>> >> earlier version, specify on the command line:
>>> >>
>>> >> -mca pml teg
>>> >> Hi Tim,
>>> >>   I tried the same cluster (16 node x86) with the switches -mca 
          
>>> pml
      
>>> >> teg and I get good performance of 24.52Gflops at N=22500
>>> >> and Block size NB=120.
>>> >> My command line now looks like :
>>> >> a1> mpirun -mca pls_rsh_orted /home/allan/openmpi/bin/orted -mca 
          
>>> pml
      
>>> >> teg -hostile aa -np 16 ./xhpl
>>> >> hostfile = aa, containing the addresses of the 16 machines.
>>> >> I am using a GS116 16 port netgear Gigabit ethernet switch with 
          
>>> Gnet
      
>>> >> realtek gig ethernet cards
>>> >> Why, PLEASE, do these switches pml teg make such a difference? 
          
>>> It's
      
>>> >> 2.6 times more performance in GFlops than what I was getting 
          
>>> without
      
>>> >> them.
>>> >> I tried version rc3 and not an earlier version.
>>> >> Thank you very much for your assistance!
>>> >>
          
>>>
>> >
>> > Sorry for the delay in replying to this...
>> >
>> > The "pml teg" switch tells Open MPI to use the 2nd generation TCP
>> > implementation rather than the 3rd generation TCP.  More 
      
>> specifically,
    
>> > the "PML" is the point-to-point management layer.  There are 2
>> > different components for this -- teg (2nd generation) and ob1 (3rd
>> > generation).  "ob1" is the default; specifying "--mca pml teg" tells
>> > Open MPI to use the "teg" component instead of ob1.
>> >
>> > Note, however, that teg and ob1 know nothing about TCP -- it's the 
      
>> 2nd
    
>> > order implications that make the difference here.  teg and ob1 use
>> > different back-end components to talk across networks:
>> >
>> > - teg uses PTL components (point-to-point transport layer -- 2nd 
      
>> gen)
    
>> > - ob1 uses BTL components (byte transfer layer -- 3rd gen)
>> >
>> > We obviously have TCP implementations for both the PTL and BTL.
>> > Considerable time was spent optimizing the TCP PTL (i.e., 2nd gen).
>> > Unfortunately, as yet, little time has been spent optimizing the TCP
>> > BTL (i.e., 3rd gen) -- it was a simple port, nothing more.
>> >
>> > We have spent the majority of our time, so far, optimizing the 
      
>> Myrinet
    
>> > and Infiniband BTLs (therefore showing that excellent performance is
>> > achievable in the BTLs).  However, I'm quite disappointed by the TCP
>> > BTL performance -- it sounds like we have a protocol mismatch that 
      
>> is
    
>> > arbitrarily slowing everything down, and something that needs to be
>> > fixed before 1.0 (it's not a problem with the BTL design, since IB 
      
>> and
    
>> > Myrinet performance is quite good -- just a problem/bug in the TCP
>> > implementation of the BTL).  That much performance degradation is
>> > clearly unacceptable.
>> >
>> > --
>> > {+} Jeff Squyres
>> > {+} The Open MPI Project
>> > {+} http://www.open-mpi.org/
>> >
>> > _______________________________________________
>> > users mailing list
>> > users@open-mpi.org
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >
      
>>
    
>
>
>
> -- 
> _______________________________________________
> users mailing list
> users@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
  

-- {+} Jeff Squyres {+} The Open MPI Project {+} http://www.open-mpi.org/