Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Infiniband performance Problem and stalling
From: Randolph Pullen (randolph_pullen_at_[hidden])
Date: 2012-09-09 23:32:15


See my comments in line... ________________________________ From: Yevgeny Kliteynik <kliteyn_at_[hidden]> To: Randolph Pullen <randolph_pullen_at_[hidden]> Cc: OpenMPI Users <users_at_[hidden]> Sent: Sunday, 9 September 2012 6:18 PM Subject: Re: [OMPI users] Infiniband performance Problem and stalling Randolph, On 9/7/2012 7:43 AM, Randolph Pullen wrote: > Yevgeny, > The ibstat results: > CA 'mthca0' > CA type: MT25208 (MT23108 compat mode) What you have is InfiniHost III HCA, which is 4x SDR card. This card has theoretical peak of 10 Gb/s, which is 1GB/s in IB bit coding. > And more interestingly, ib_write_bw: > Conflicting CPU frequency values detected: 1600.000000 != 3301.000000 > > What does Conflicting CPU frequency values mean? > > Examining the /proc/cpuinfo file however shows: > processor : 0 > cpu MHz : 3301.000 > processor : 1 > cpu MHz : 3301.000 > processor : 2 > cpu MHz : 1600.000 > processor : 3 > cpu MHz : 1600.000 > > Which seems oddly wierd to me... You need to have all the cores running at highest clock to get better numbers. May be you have power governor not set to optimal performance on these machines. Google for "Linux CPU scaling governor" to get more info on this subject, or contact your system admin and ask him to take care of the CPU frequencies. Once this is done, check all the pairs of your machines - ensure that you get a good number with ib_write_br. Note that if you have a slower machine in the cluster, general application performance will suffer from this. I have anchored the clocks speeds to: [root_at_vh1 ~]#   cat /sys/devices/system/cpu/*/cpufreq/cpuinfo_cur_freq 3600000 3600000 3600000 3600000 3600000 3600000 3600000 3600000 [root_at_vh2 ~]#  cat /sys/devices/system/cpu/*/cpufreq/cpuinfo_cur_freq 3200000 3200000 3200000 3200000 However /proc/cpuinfo still reports them incorrectly  [deepcloud_at_vh2 c]$  grep MHz /proc/cpuinfo  cpu MHz         : 3300.000 cpu MHz         : 1600.000 cpu MHz         : 1600.000 cpu MHz         : 1600.000 I don't think this is the problem, so I used -F option in  ib_write_bw to push ahead. ie; [deepcloud_at_vh2 c]$  ib_write_bw -F vh1 ------------------------------------------------------------------                     RDMA_Write BW Test  Number of qps   : 1  Connection type : RC  TX depth        : 300  CQ Moderation   : 50  Link type       : IB  Mtu             : 2048  Inline data is used up to 0 bytes message  local address: LID 0x04 QPN 0xaa0408 PSN 0xf9c072 RKey 0x59260052 VAddr 0x002b03a8af3000  remote address: LID 0x03 QPN 0x8b0408 PSN 0xe4890d RKey 0x4a62003c VAddr 0x002b8e44297000 ------------------------------------------------------------------  #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec] Conflicting CPU frequency values detected: 3300.000000 != 1600.000000 Test integrity may be harmed ! Conflicting CPU frequency values detected: 3300.000000 != 1600.000000 Test integrity may be harmed ! Conflicting CPU frequency values detected: 3300.000000 != 1600.000000 Test integrity may be harmed ! Warning: measured timestamp frequency 3092.95 differs from nominal 3300 MHz  65536     5000           937.61             937.60  ------------------------------------------------------------------ >  > On 8/31/2012 10:53 AM, Randolph Pullen wrote: >  > > (reposted with consolidatedinformation) >  > > I have a test rig comprising 2 i7 systems 8GB RAM with Melanox III HCA 10G cards >  > > running Centos 5.7 Kernel 2.6.18-274 >  > > Open MPI 1.4.3 >  > > MLNX_OFED_LINUX-1.5.3-1.0.0.2 (OFED-1.5.3-1.0.0.2): >  > > On a Cisco 24 pt switch >  > > Normal performance is: >  > > $ mpirun --mca btl openib,self -n 2 -hostfile mpi.hosts PingPong >  > > results in: >  > > Max rate = 958.388867 MB/sec Min latency = 4.529953 usec >  > > and: >  > > $ mpirun --mca btl tcp,self -n 2 -hostfile mpi.hosts PingPong >  > > Max rate = 653.547293 MB/sec Min latency = 19.550323 usec These numbers look fine - 958 MB/s on IB is close to theoretical limit. 654 MB/s for IPoIB look fine too. >  > > My problem is I see better performance under IPoIB then I do on native IB (RDMA_CM). I don't see this in your numbers. What do I miss? Runs in 9 seconds: mpirun --mca btl_openib_eager_limit 64 --mca mpi_leave_pinned 1 --mca btl openib,self -H vh2,vh1 -np 9 --bycore  prog mpirun --mca btl_openib_flags 2 --mca mpi_leave_pinned 1 --mca btl tcp,self -H vh2,vh1 -np 9 --bycore prog Runs in 24 seconds or more: mpirun --mca btl_openib_flags 2 --mca mpi_leave_pinned 1 --mca btl openib,self -H vh2,vh1 -np 9 --bycore prog mpirun --mca btl_openib_flags 2 --mca mpi_leave_pinned 1 --mca btl openib,self,sm -H vh2,vh1 -np 9 --bycore prog mpirun --mca btl_openib_eager_limit 64 --mca mpi_leave_pinned 1 --mca btl openib,self,sm  -H vh2,vh1 -np 9 --bycore  prog Note: - adding sm to the fastest openib run results in a 13 second penalty - Subsequent runs with openib usually add at least 10 seconds per run or stall >  > > My understanding is that IPoIB is limited to about 1G/s so I am at a loss to know why it is faster. Again, I see IPoIB performance under 1 GB/s. >  > > And this one produces similar run times but seems to degrade with repeated cycles: >  > > mpirun --mca btl_openib_eager_limit 64 --mca mpi_leave_pinned 1 --mca btl openib,self -H vh2,vh1 -np 9 --bycore prog > > You're running 9 ranks on two machines, but you're using IB for intra-node communication. > Is it intentional? If not, you can add "sm" btl and have performance improved. Also, don't forget to include "sm" btl if you have more than 1 MPI rank per node. See above: adding sm to the fastest openib run results in a 13 second penalty -- YK