Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Allan Menezes (amenezes007_at_[hidden])
Date: 2005-11-14 22:49:37


Dear Jeff, I reorganized my cluster and ran the following test with 15
nodes: [allan_at_a1 bench]$ mpirun -mca btl tcp --mca btl_tcp_if_include
eth1 --prefix /home/allan/openmpi -hostfile aa -np 15 ./xhpl
[0,1,11][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances]
invalid interface "eth1"
[0,1,12][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances]
invalid interface "eth1"
[0,1,14][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances]
invalid interface "eth1"
[0,1,13][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances]
invalid interface "eth1" [allan_at_a1 bench]$ It spewed out the above
errors but I continued the test for 2and half hours monitoring HPL.out.
It gives a maximum of 21.77GFlops for 15 nodes which is not bad. I think
the reason it spewed out those errors is because on the four X88-64
machines a13-16 the NIC card connected to the LAN (gigabit) are eth0 and
not eth1 like the rest. The head node is eth0. I removed one NIC from
the head node to make things simpler to trouble shoot. Here is HPL.out
============================================================================
HPLinpack 1.0a -- High-Performance Linpack benchmark -- January 20, 2004
Written by A. Petitet and R. Clint Whaley, Innovative Computing Labs.,
UTK
============================================================================
An explanation of the input/output parameters follows: T/V : Wall time /
encoded variant. N : The order of the coefficient matrix A. NB : The
partitioning blocking factor. P : The number of process rows. Q : The
number of process columns. Time : Time in seconds to solve the linear
system. Gflops : Rate of execution for solving the linear system. The
following parameter values will be used: N : 25920 NB : 120 PMAP :
Row-major process mapping P : 3 Q : 5 PFACT : Left Crout Right NBMIN : 2
4 NDIV : 2 RFACT : Left Crout Right BCAST : 1ring DEPTH : 0 SWAP : Mix
(threshold = 64) L1 : transposed form U : transposed form EQUIL : yes
ALIGN : 8 double precision words
----------------------------------------------------------------------------
- The matrix A is randomly generated for each test. - The following
scaled residual checks will be computed: 1) ||Ax-b||_oo / ( eps *
||A||_1 * N ) 2) ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) 3)
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) - The relative machine
precision (eps) is taken to be 1.110223e-16 - Computational tests pass
if scaled residuals are less than 16.0
============================================================================
T/V N NB P Q Time Gflops
----------------------------------------------------------------------------
WR00L2L2 25920 120 3 5 534.25 2.173e+01
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0128599 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0185612 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0037747 ...... PASSED
============================================================================
T/V N NB P Q Time Gflops
----------------------------------------------------------------------------
WR00L2L4 25920 120 3 5 536.98 2.162e+01
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0117992 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0170302 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0034634 ...... PASSED
============================================================================
T/V N NB P Q Time Gflops
----------------------------------------------------------------------------
WR00L2C2 25920 120 3 5 540.73 2.147e+01
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0128599 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0185612 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0037747 ...... PASSED
============================================================================
T/V N NB P Q Time Gflops
----------------------------------------------------------------------------
WR00L2C4 25920 120 3 5 533.76 2.175e+01
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0121362 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0175166 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0035623 ...... PASSED
============================================================================
T/V N NB P Q Time Gflops
----------------------------------------------------------------------------
WR00L2R2 25920 120 3 5 537.28 2.161e+01
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0117731 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0169925 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0034557 ...... PASSED
============================================================================
T/V N NB P Q Time Gflops
----------------------------------------------------------------------------
WR00L2R4 25920 120 3 5 533.38 2.177e+01
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0109683 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0158310 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0032195 ...... PASSED
============================================================================
T/V N NB P Q Time Gflops
----------------------------------------------------------------------------
WR00C2L2 25920 120 3 5 540.45 2.148e+01
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0128599 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0185612 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0037747 ...... PASSED
============================================================================
T/V N NB P Q Time Gflops
----------------------------------------------------------------------------
WR00C2L4 25920 120 3 5 536.87 2.163e+01
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0117992 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0170302 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0034634 ...... PASSED
============================================================================
T/V N NB P Q Time Gflops
----------------------------------------------------------------------------
WR00C2C2 25920 120 3 5 533.98 2.174e+01
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0128599 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0185612 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0037747 ...... PASSED
============================================================================
T/V N NB P Q Time Gflops
----------------------------------------------------------------------------
WR00C2C4 25920 120 3 5 535.31 2.169e+01
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0121362 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0175166 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0035623 ...... PASSED
============================================================================
T/V N NB P Q Time Gflops
----------------------------------------------------------------------------
WR00C2R2 25920 120 3 5 536.65 2.164e+01
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0117731 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0169925 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0034557 ...... PASSED
============================================================================
T/V N NB P Q Time Gflops
----------------------------------------------------------------------------
WR00C2R4 25920 120 3 5 536.97 2.162e+01
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0109683 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0158310 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0032195 ...... PASSED
============================================================================
T/V N NB P Q Time Gflops
----------------------------------------------------------------------------
WR00R2L2 25920 120 3 5 534.09 2.174e+01
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0128599 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0185612 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0037747 ...... PASSED
============================================================================
T/V N NB P Q Time Gflops
----------------------------------------------------------------------------
WR00R2L4 25920 120 3 5 534.96 2.170e+01
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0117992 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0170302 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0034634 ...... PASSED
============================================================================
T/V N NB P Q Time Gflops
----------------------------------------------------------------------------
WR00R2C2 25920 120 3 5 536.73 2.163e+01
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0128599 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0185612 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0037747 ...... PASSED
============================================================================
T/V N NB P Q Time Gflops
----------------------------------------------------------------------------
WR00R2C4 25920 120 3 5 536.91 2.162e+01
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0121362 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0175166 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0035623 ...... PASSED
============================================================================
T/V N NB P Q Time Gflops
----------------------------------------------------------------------------
WR00R2R2 25920 120 3 5 535.96 2.166e+01
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0117731 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0169925 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0034557 ...... PASSED
============================================================================
T/V N NB P Q Time Gflops
----------------------------------------------------------------------------
WR00R2R4 25920 120 3 5 536.16 2.165e+01
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0109683 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0158310 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0032195 ...... PASSED
============================================================================
Finished 18 tests with the following results: 18 tests completed and
passed residual checks, 0 tests completed and failed residual checks, 0
tests skipped because of illegal input values.
----------------------------------------------------------------------------
End of Tests.
============================================================================
Here is the result of the test carried out with --mca btl tcp --mca
btl_tcp_if_include eth1,eth0 which hangs. [allan_at_a1 bench]$ mpirun -mca
btl tcp --mca btl_tcp_if_include eth1,eth0 --prefix /home/allan/openmpi
-hostfile aa -np 15 ./xhpl
[0,1,1][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances]
invalid interface "eth0"
[0,1,6][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances]
invalid interface "eth0"
[0,1,7][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances]
invalid interface "eth0"
[0,1,12][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances]
invalid interface "eth1"
[0,1,11][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances]
invalid interface "eth1"
[0,1,2][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances]
invalid interface "eth0"
[0,1,3][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances]
invalid interface "eth0"
[0,1,8][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances]
invalid interface "eth0"
[0,1,4][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances]
invalid interface "eth0"
[0,1,5][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances]
invalid interface "eth0"
[0,1,10][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances]
invalid interface "eth0"
[0,1,13][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances]
invalid interface "eth1"
[0,1,14][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances]
invalid interface "eth1"
[0,1,9][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances]
invalid interface "eth0" Tell me if I connect all the 10/100Mbps cards
on a 10/100Mbps switch along with the gigabit and specify as before
--mca btl_tcp_if_include eth1,eth0 the problem will go away and I may
get increased bandwidth. I will be trying the same with the switches pml
teg to see if there is a difference! Thank you, Allan Message: 2 Date:
Sun, 13 Nov 2005 15:51:30 -0500 From: Jeff Squyres
<jsquyres_at_[hidden]> Subject: Re: [O-MPI users] HPL anf TCP To: Open
MPI Users <users_at_[hidden]> Message-ID:
<f143e44670c59a2f345708e6e0fad549_at_[hidden]> Content-Type:
text/plain; charset=US-ASCII; format=flowed On Nov 3, 2005, at 8:35 PM,
Allan Menezes wrote:

>> 1. No, I have 4 NICs on the head node and two on each of the 15 other
>> compute nodes. I use the realtek 8169 gigabit ethernet cards on the
>> compute nodes as eth1 or eth0(one only) connected to a gigabit
>> ethernet switch with bisection bandwidth of 32Gbps and a sk98lin
>> driver 3Com built in gigabit ethernet NIC card on the head node(eth3).
>> The other ethernet cards 10/100M on the head node handle a network
>> laser printer(eth0) and eth2 (10/100M) internet access. Eth1 is a
>> spare 10/100M which I can remove. The compute nodes each have two
>> ethernet cards one 10/100Mbps ethernet not connected to anything(built
>> in to M/B) and a PCI realtek 8169 gigabit ethernet connected to the
>> TCP network LAN(Gigabit). When I tried it without the switches -mca
>> pml teg the maximum performace I would get with it was 9GFlops for P=4
>> Q=4 N=approx 12- 16 thousand and NB ridiculously low at 10 block size.
>> If I tried bigger block sizes it would run for along time for large N
>> ~ 16,000 unless I killed xhpl. I use atlas BLAS 3.7.11 libs compiled
>> for each node and linked to HPL when creating xhpl. I also use open
>> mpi mpicc in the hpl make file for compile and link both. Maybe I
>> should according to the new faq use the TCP switch to use eth3 on the
>> head node?
>
>

So if I'm reading that right, there's only one network that connects
the head node and the compute nodes, right?

>> 2. I have 512MB of memory per node which is 8 GB total, so I can
>> safely go upto N=22,000 24,000. I used sizes of 22000 for TCP teg and
>> did not run into problems. But if I do not specify the switches
>> suggested by Tim I get bad performance for N = 12000.
>
>

I must admit that I'm still befuddled by this -- we are absolutely
unable to duplicate this behavior. It *sounds* like there is some
network mismatching going on in here -- that the tcp btl is somehow
routing information differently than the tcp ptl (and therefore taking
longer -- timing out and the like).

We did make some improvements to the tcp subnet mask matching code for
rc5; I had to ask again, but could you try with the latest nightly
snapshot tarball?

        http://www.open-mpi.org/nightly/v1.0/

>> 4. My cluster is an experimental Basement Cluster [BSquared = Brampton
>> Beowulf] built out of x86 Durons(6), 2 athlons, 2 semprons, two P4s, 2
>> 64 bit x86_64 AMD64 ATHLONS and two AMD x86_64 SEmprons(754 pin) for a
>> total of 16 machines running FC3 and Oscar beta cluster software. I
>> have not tried it with the latest open mpi snapshot yet but I will to
>> night. I think I should reinstall FC3 on the head node P4 2.8GHz and
>> reinstall all the compute nodes with Oscar beta Nov 3, 2005 and open
>> mpi of todays Nov 3, 2005 1.0 snapshot and try again. I could have
>> made an errror somewhere before. It should not take me long. But I
>> doubt it as MPICH2 and open mpi with the switches pml teg give good
>> comparable performance. I was not using jumo MTU frames either just
>> 1500bytes. It is not homogenous (BSquared) but a good test set up.
>> If you have any advice, Please tell me and I could try it out.
>> Thank you and good luck!
>> Allan
>>
>>
>>
>>
>>
>> On Oct 27, 2005, at 10:19 AM, Jeff Squyres wrote:
>>
>>
>
>
>>>>> > On Oct 19, 2005, at 12:04 AM, Allan Menezes wrote:
>>>>> >
>>>>> >
>>>
>>>
>>>>
>>
>>
>>>>>>>> >> We've done linpack runs recently w/ Infiniband, which result in
>>>>>>>> >> performance
>>>>>>>> >> comparable to mvapich, but not w/ the tcp port. Can you try
>>>>>>>> >> running w/
>>>>>>>> >> an
>>>>>>>> >> earlier version, specify on the command line:
>>>>>>>> >>
>>>>>>>> >> -mca pml teg
>>>>>>>> >> Hi Tim,
>>>>>>>> >> I tried the same cluster (16 node x86) with the switches -mca
>>>>>
>>>>>
>>>>>> pml
>>>
>>>
>>>>>>>> >> teg and I get good performance of 24.52Gflops at N=22500
>>>>>>>> >> and Block size NB=120.
>>>>>>>> >> My command line now looks like :
>>>>>>>> >> a1> mpirun -mca pls_rsh_orted /home/allan/openmpi/bin/orted -mca
>>>>>
>>>>>
>>>>>> pml
>>>
>>>
>>>>>>>> >> teg -hostile aa -np 16 ./xhpl
>>>>>>>> >> hostfile = aa, containing the addresses of the 16 machines.
>>>>>>>> >> I am using a GS116 16 port netgear Gigabit ethernet switch with
>>>>>
>>>>>
>>>>>> Gnet
>>>
>>>
>>>>>>>> >> realtek gig ethernet cards
>>>>>>>> >> Why, PLEASE, do these switches pml teg make such a difference?
>>>>>
>>>>>
>>>>>> It's
>>>
>>>
>>>>>>>> >> 2.6 times more performance in GFlops than what I was getting
>>>>>
>>>>>
>>>>>> without
>>>
>>>
>>>>>>>> >> them.
>>>>>>>> >> I tried version rc3 and not an earlier version.
>>>>>>>> >> Thank you very much for your assistance!
>>>>>>>> >>
>>>>>
>>>>>
>>>>>>
>>>>> >
>>>>> > Sorry for the delay in replying to this...
>>>>> >
>>>>> > The "pml teg" switch tells Open MPI to use the 2nd generation TCP
>>>>> > implementation rather than the 3rd generation TCP. More
>>>
>>>
>>>> specifically,
>>
>>
>>>>> > the "PML" is the point-to-point management layer. There are 2
>>>>> > different components for this -- teg (2nd generation) and ob1 (3rd
>>>>> > generation). "ob1" is the default; specifying "--mca pml teg" tells
>>>>> > Open MPI to use the "teg" component instead of ob1.
>>>>> >
>>>>> > Note, however, that teg and ob1 know nothing about TCP -- it's the
>>>
>>>
>>>> 2nd
>>
>>
>>>>> > order implications that make the difference here. teg and ob1 use
>>>>> > different back-end components to talk across networks:
>>>>> >
>>>>> > - teg uses PTL components (point-to-point transport layer -- 2nd
>>>
>>>
>>>> gen)
>>
>>
>>>>> > - ob1 uses BTL components (byte transfer layer -- 3rd gen)
>>>>> >
>>>>> > We obviously have TCP implementations for both the PTL and BTL.
>>>>> > Considerable time was spent optimizing the TCP PTL (i.e., 2nd gen).
>>>>> > Unfortunately, as yet, little time has been spent optimizing the TCP
>>>>> > BTL (i.e., 3rd gen) -- it was a simple port, nothing more.
>>>>> >
>>>>> > We have spent the majority of our time, so far, optimizing the
>>>
>>>
>>>> Myrinet
>>
>>
>>>>> > and Infiniband BTLs (therefore showing that excellent performance is
>>>>> > achievable in the BTLs). However, I'm quite disappointed by the TCP
>>>>> > BTL performance -- it sounds like we have a protocol mismatch that
>>>
>>>
>>>> is
>>
>>
>>>>> > arbitrarily slowing everything down, and something that needs to be
>>>>> > fixed before 1.0 (it's not a problem with the BTL design, since IB
>>>
>>>
>>>> and
>>
>>
>>>>> > Myrinet performance is quite good -- just a problem/bug in the TCP
>>>>> > implementation of the BTL). That much performance degradation is
>>>>> > clearly unacceptable.
>>>>> >
>>>>> > --
>>>>> > {+} Jeff Squyres
>>>>> > {+} The Open MPI Project
>>>>> > {+} http://www.open-mpi.org/
>>>>> >
>>>>> > _______________________________________________
>>>>> > users mailing list
>>>>> > users_at_[hidden]
>>>>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> >
>>>
>>>
>>>>
>>
>>
>>
>>
>>
>> --
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>

-- {+} Jeff Squyres {+} The Open MPI Project {+} http://www.open-mpi.org/