Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Infiniband performance Problem and stalling
From: Yevgeny Kliteynik (kliteyn_at_[hidden])
Date: 2012-09-06 04:03:04


On 9/3/2012 4:14 AM, Randolph Pullen wrote:
> No RoCE, Just native IB with TCP over the top.

Sorry, I'm confused - still not clear what is "Melanox III HCA 10G card".
Could you run "ibstat" and post the results?

What is the expected BW on your cards?
Could you run "ib_write_bw" between two machines?

Also, please see below.

> No I haven't used 1.6 I was trying to stick with the standards on the mellanox disk.
> Is there a known problem with 1.4.3 ?
>
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> *From:* Yevgeny Kliteynik <kliteyn_at_[hidden]>
> *To:* Randolph Pullen <randolph_pullen_at_[hidden]>; Open MPI Users <users_at_[hidden]>
> *Sent:* Sunday, 2 September 2012 10:54 PM
> *Subject:* Re: [OMPI users] Infiniband performance Problem and stalling
>
> Randolph,
>
> Some clarification on the setup:
>
> "Melanox III HCA 10G cards" - are those ConnectX 3 cards configured to Ethernet?
> That is, when you're using openib BTL, you mean RoCE, right?
>
> Also, have you had a chance to try some newer OMPI release?
> Any 1.6.x would do.
>
>
> -- YK
>
> On 8/31/2012 10:53 AM, Randolph Pullen wrote:
> > (reposted with consolidatedinformation)
> > I have a test rig comprising 2 i7 systems 8GB RAM with Melanox III HCA 10G cards
> > running Centos 5.7 Kernel 2.6.18-274
> > Open MPI 1.4.3
> > MLNX_OFED_LINUX-1.5.3-1.0.0.2 (OFED-1.5.3-1.0.0.2):
> > On a Cisco 24 pt switch
> > Normal performance is:
> > $ mpirun --mca btl openib,self -n 2 -hostfile mpi.hosts PingPong
> > results in:
> > Max rate = 958.388867 MB/sec Min latency = 4.529953 usec
> > and:
> > $ mpirun --mca btl tcp,self -n 2 -hostfile mpi.hosts PingPong
> > Max rate = 653.547293 MB/sec Min latency = 19.550323 usec
> > NetPipeMPI results show a max of 7.4 Gb/s at 8388605 bytes which seems fine.
> > log_num_mtt =20 and log_mtts_per_seg params =2
> > My application exchanges about a gig of data between the processes with 2 sender and 2 consumer processes on each node with 1 additional controller process on the starting node.
> > The program splits the data into 64K blocks and uses non blocking sends and receives with busy/sleep loops to monitor progress until completion.
> > Each process owns a single buffer for these 64K blocks.
> > My problem is I see better performance under IPoIB then I do on native IB (RDMA_CM).
> > My understanding is that IPoIB is limited to about 1G/s so I am at a loss to know why it is faster.
> > These 2 configurations are equivelant (about 8-10 seconds per cycle)
> > mpirun --mca btl_openib_flags 2 --mca mpi_leave_pinned 1 --mca btl tcp,self -H vh2,vh1 -np 9 --bycore prog
> > mpirun --mca btl_openib_flags 3 --mca mpi_leave_pinned 1 --mca btl tcp,self -H vh2,vh1 -np 9 --bycore prog

When you say "--mca btl tcp,self", it means that openib btl is not enabled.
Hence "--mca btl_openib_flags" is irrelevant.

> > And this one produces similar run times but seems to degrade with repeated cycles:
> > mpirun --mca btl_openib_eager_limit 64 --mca mpi_leave_pinned 1 --mca btl openib,self -H vh2,vh1 -np 9 --bycore prog

You're running 9 ranks on two machines, but you're using IB for intra-node communication.
Is it intentional? If not, you can add "sm" btl and have performance improved.

-- YK

> > Other btl_openib_flags settings result in much lower performance.
> > Changing the first of the above configs to use openIB results in a 21 second run time at best. Sometimes it takes up to 5 minutes.
> > In all cases, OpenIB runs in twice the time it takes TCP,except if I push the small message max to 64K and force short messages. Then the openib times are the same as TCP and no faster.
> > With openib:
> > - Repeated cycles during a single run seem to slow down with each cycle
> > (usually by about 10 seconds).
> > - On occasions it seems to stall indefinitely, waiting on a single receive.
> > I'm still at a loss as to why. I can’t find any errors logged during the runs.
> > Any ideas appreciated.
> > Thanks in advance,
> > Randolph
> >
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden] <mailto:users_at_[hidden]>
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>