Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Infiniband performance Problem and stalling
From: Randolph Pullen (randolph_pullen_at_[hidden])
Date: 2012-09-10 23:47:03


Thats very interesting Yevgeny, Yes tcp,self ran in 12 seconds tcp,self,sm ran in 27 seconds Does anyone have any idea how this can be? About half the data would go to local processes, so SM should pay dividends. ________________________________ From: Yevgeny Kliteynik <kliteyn_at_[hidden]> To: Randolph Pullen <randolph_pullen_at_[hidden]> Cc: OpenMPI Users <users_at_[hidden]> Sent: Monday, 10 September 2012 9:11 PM Subject: Re: [OMPI users] Infiniband performance Problem and stalling Randolph, So what you saying in short, leaving all the numbers aside, is the following: In your particular application on your particular setup with this particular OMPI version, 1. openib BTL performs faster than shared memory BTL 2. TCP BTL performs faster than shared memory IMHO, this indicates that you have some problem on your machines, and this problem is unrelated to interconnect. Shared memory should be much faster than IB, not to mention IPoIB. Could you run these two commands? mpirun --mca btl tcp,self    -H vh2,vh1 -np 9 --bycore prog mpirun --mca btl tcp,self,sm -H vh2,vh1 -np 9 --bycore prog You will probably see better number w/o sm. Why? Don't know. Perhaps someone who has better knowledge in sm BTL can elaborate? -- YK On 9/10/2012 6:32 AM, Randolph Pullen wrote: > See my comments in line... > >
 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------ > *From:* Yevgeny Kliteynik <kliteyn_at_[hidden]> > *To:* Randolph Pullen <randolph_pullen_at_[hidden]> > *Cc:* OpenMPI Users <users_at_[hidden]> > *Sent:* Sunday, 9 September 2012 6:18 PM > *Subject:* Re: [OMPI users] Infiniband performance Problem and stalling > > Randolph, > > On 9/7/2012 7:43 AM, Randolph Pullen wrote: >  > Yevgeny, >  > The ibstat results: >  > CA 'mthca0' >  > CA type: MT25208 (MT23108 compat mode) > > What you have is InfiniHost III HCA, which is 4x SDR card. > This card has theoretical peak of 10 Gb/s, which is 1GB/s in IB bit coding. > >  > And more interestingly, ib_write_bw: >  > Conflicting CPU frequency values detected: 1600.000000 != 3301.000000 >  > >  > What does Conflicting CPU frequency values mean? >  > >  > Examining the /proc/cpuinfo file however shows: >  > processor : 0 >  > cpu MHz : 3301.000 >  > processor : 1 >  > cpu MHz : 3301.000 >  > processor : 2 >  > cpu MHz : 1600.000 >  > processor : 3 >  > cpu MHz : 1600.000 >  > >  > Which seems oddly wierd to me... > > You need to have all the cores running at highest clock to get better numbers. > May be you have power governor not set to optimal performance on these machines. > Google for "Linux CPU scaling governor" to get more info on this subject, or > contact your system admin and ask him to take care of the CPU frequencies. > > Once this is done, check all the pairs of your machines - ensure that you get > a good number with ib_write_br. > Note that if you have a slower machine in the cluster, general application > performance will suffer from this. > > I have anchored the clocks speeds to: > [root_at_vh1 ~]# cat /sys/devices/system/cpu/*/cpufreq/cpuinfo_cur_freq > 3600000 > 3600000 > 3600000 > 3600000 > 3600000 > 3600000 > 3600000 > 3600000 > > [root_at_vh2 ~]# cat /sys/devices/system/cpu/*/cpufreq/cpuinfo_cur_freq > 3200000 > 3200000 > 3200000 > 3200000 > > However /proc/cpuinfo still reports them incorrectly > [deepcloud_at_vh2 c]$ grep MHz /proc/cpuinfo > cpu MHz : 3300.000 > cpu MHz : 1600.000 > cpu MHz : 1600.000 > cpu MHz : 1600.000 > > I don't think this is the problem, so I used -F option in ib_write_bw to push ahead. ie; > [deepcloud_at_vh2 c]$ ib_write_bw -F vh1 > ------------------------------------------------------------------ > RDMA_Write BW Test > Number of qps : 1 > Connection type : RC > TX depth : 300 > CQ Moderation : 50 > Link type : IB > Mtu : 2048 > Inline data is used up to 0 bytes message > local address: LID 0x04 QPN 0xaa0408 PSN 0xf9c072 RKey 0x59260052 VAddr 0x002b03a8af3000 > remote address: LID 0x03 QPN 0x8b0408 PSN 0xe4890d RKey 0x4a62003c VAddr 0x002b8e44297000 > ------------------------------------------------------------------ > #bytes #iterations BW peak[MB/sec] BW average[MB/sec] > Conflicting CPU frequency values detected: 3300.000000 != 1600.000000 > Test integrity may be harmed ! > Conflicting CPU frequency values detected: 3300.000000 != 1600.000000 > Test integrity may be harmed ! > Conflicting CPU frequency values detected: 3300.000000 != 1600.000000 > Test integrity may be harmed ! > Warning: measured timestamp frequency 3092.95 differs from nominal 3300 MHz > 65536 5000 937.61 937.60 > ------------------------------------------------------------------ > > > > *> > On 8/31/2012 10:53 AM, Randolph Pullen wrote:* > *> > > (reposted with consolidatedinformation)* > *> > > I have a test rig comprising 2 i7 systems 8GB RAM with Melanox III HCA 10G cards* > *> > > running Centos 5.7 Kernel 2.6.18-274* > *> > > Open MPI 1.4.3* > *> > > MLNX_OFED_LINUX-1.5.3-1.0.0.2 (OFED-1.5.3-1.0.0.2):* > *> > > On a Cisco 24 pt switch* > *> > > Normal performance is:* > *> > > $ mpirun --mca btl openib,self -n 2 -hostfile mpi.hosts PingPong* > *> > > results in:* > *> > > Max rate = 958.388867 MB/sec Min latency = 4.529953 usec* > *> > > and:* > *> > > $ mpirun --mca btl tcp,self -n 2 -hostfile mpi.hosts PingPong* > *> > > Max rate = 653.547293 MB/sec Min latency = 19.550323 usec* > > *These numbers look fine - 958 MB/s on IB is close to theoretical limit.* > *654 MB/s for IPoIB look fine too.* > > *> > > My problem is I see better performance under IPoIB then I do on native IB (RDMA_CM).* > > *I don't see this in your numbers. What do I miss?* > > Runs in 9 seconds: > mpirun --mca btl_openib_eager_limit 64 --mca mpi_leave_pinned 1 --mca btl openib,self -H vh2,vh1 -np 9 --bycore prog > mpirun --mca btl_openib_flags 2 --mca mpi_leave_pinned 1 --mca btl tcp,self -H vh2,vh1 -np 9 --bycore prog > > Runs in 24 seconds or more: > mpirun --mca btl_openib_flags 2 --mca mpi_leave_pinned 1 --mca btl openib,self -H vh2,vh1 -np 9 --bycore prog > mpirun --mca btl_openib_flags 2 --mca mpi_leave_pinned 1 --mca btl openib,self,sm -H vh2,vh1 -np 9 --bycore prog > mpirun --mca btl_openib_eager_limit 64 --mca mpi_leave_pinned 1 --mca btl openib,self,sm -H vh2,vh1 -np 9 --bycore prog > > Note: > - adding sm to the fastest openib run results in a 13 second penalty > - Subsequent runs with openib usually add at least 10 seconds per run or stall > > *> > > My understanding is that IPoIB is limited to about 1G/s so I am at a loss to know why it is faster.* > > *Again, I see IPoIB performance under 1 GB/s.* > > *> > > And this one produces similar run times but seems to degrade with repeated cycles:* > *> > > mpirun --mca btl_openib_eager_limit 64 --mca mpi_leave_pinned 1 --mca btl openib,self -H vh2,vh1 -np 9 --bycore prog* > *> * > *> You're running 9 ranks on two machines, but you're using IB for intra-node communication.* > *> Is it intentional? If not, you can add "sm" btl and have performance improved.* > > *Also, don't forget to include "sm" btl if you have more than 1 MPI rank per node.* > See above: adding sm to the fastest openib run results in a 13 second penalty > > > *-- YK* > > >