Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC: OB1 optimizations
From: Nathan Hjelm (hjelmn_at_[hidden])
Date: 2014-01-08 11:57:07


Sorry, should have said that this is a different cluster. These results
were on Sandy Bridge and the others were on Haswell. Don't have mvapich
on the Haswell cluster. Will check the current patch on Haswell later
today.

-Nathan

On Wed, Jan 08, 2014 at 05:50:34PM +0100, George Bosilca wrote:
> These results are way worst that the one you send on your previous email? What is the reason?
>
> George.
>
> On Jan 8, 2014, at 17:33 , Nathan Hjelm <hjelmn_at_[hidden]> wrote:
>
> > Ah, good catch. A new version is attached that should eliminate the race
> > window for the multi-threaded case. Performance numbers are still
> > looking really good. We beat mvapich2 in the small message ping-pong by
> > a good margin. See the results below. The large message latency
> > difference for large messages is probably due to a difference in the max
> > send size for vader vs mvapich.
> >
> > To answer Pasha's question. I don't see a noticiable difference in
> > performance for btl's with no sendi function (this includes
> > ugni). OpenIB should get a boost. I will test that once I get an
> > allocation.
> >
> > CPU: Xeon E5-2670 @ 2.60 GHz
> >
> > Open MPI (-mca btl vader,self):
> > # OSU MPI Latency Test v4.1
> > # Size Latency (us)
> > 0 0.17
> > 1 0.19
> > 2 0.19
> > 4 0.19
> > 8 0.19
> > 16 0.19
> > 32 0.19
> > 64 0.40
> > 128 0.40
> > 256 0.43
> > 512 0.52
> > 1024 0.67
> > 2048 0.94
> > 4096 1.44
> > 8192 2.04
> > 16384 3.47
> > 32768 6.10
> > 65536 9.38
> > 131072 16.47
> > 262144 29.63
> > 524288 54.81
> > 1048576 106.63
> > 2097152 206.84
> > 4194304 421.26
> >
> >
> > mvapich2 1.9:
> > # OSU MPI Latency Test
> > # Size Latency (us)
> > 0 0.23
> > 1 0.23
> > 2 0.23
> > 4 0.23
> > 8 0.23
> > 16 0.28
> > 32 0.28
> > 64 0.39
> > 128 0.40
> > 256 0.40
> > 512 0.42
> > 1024 0.51
> > 2048 0.71
> > 4096 1.02
> > 8192 1.60
> > 16384 3.47
> > 32768 5.05
> > 65536 8.06
> > 131072 14.82
> > 262144 28.15
> > 524288 53.69
> > 1048576 127.47
> > 2097152 235.58
> > 4194304 683.90
> >
> >
> > -Nathan
> >
> > On Tue, Jan 07, 2014 at 06:23:13PM -0700, George Bosilca wrote:
> >> The local request is not correctly released, leading to assert in debug
> >> mode. This is because you avoid calling MCA_PML_BASE_RECV_REQUEST_FINI,
> >> fact that leaves the request in an ACTIVE state, condition carefully
> >> checked during the call to destructor.
> >>
> >> I attached a second patch that fixes the issue above, and implement a
> >> similar optimization for the blocking send.
> >>
> >> Unfortunately, this is not enough. The mca_pml_ob1_send_inline
> >> optimization is horribly wrong in a multithreaded case as it alter the
> >> send_sequence without storing it. If you create a gap in the send_sequence
> >> a deadlock will __definitively__ occur. I strongly suggest you turn off
> >> the mca_pml_ob1_send_inline optimization on the multithreaded case. All
> >> the others optimizations should be safe in all cases.
> >>
> >> George.
> >>
> >> On Jan 8, 2014, at 01:15 , Shamis, Pavel <shamisp_at_[hidden]> wrote:
> >>
> >>> Overall it looks good. It would be helpful to validate performance
> >> numbers for other interconnects as well.
> >>> -Pasha
> >>>
> >>>> -----Original Message-----
> >>>> From: devel [mailto:devel-bounces_at_[hidden]] On Behalf Of Nathan
> >>>> Hjelm
> >>>> Sent: Tuesday, January 07, 2014 6:45 PM
> >>>> To: Open MPI Developers List
> >>>> Subject: [OMPI devel] RFC: OB1 optimizations
> >>>>
> >>>> What: Push some ob1 optimizations to the trunk and 1.7.5.
> >>>>
> >>>> What: This patch contains two optimizations:
> >>>>
> >>>> - Introduce a fast send path for blocking send calls. This path uses
> >>>> the btl sendi function to put the data on the wire without the need
> >>>> for setting up a send request. In the case of btl/vader this can
> >>>> also avoid allocating/initializing a new fragment. With btl/vader
> >>>> this optimization improves small message latency by 50-200ns in
> >>>> ping-pong type benchmarks. Larger messages may take a small hit in
> >>>> the range of 10-20ns.
> >>>>
> >>>> - Use a stack-allocated receive request for blocking recieves. This
> >>>> optimization saves the extra instructions associated with accessing
> >>>> the receive request free list. I was able to get another 50-200ns
> >>>> improvement in the small-message ping-pong with this optimization. I
> >>>> see no hit for larger messages.
> >>>>
> >>>> When: These changes touch the critical path in ob1 and are targeted for
> >>>> 1.7.5. As such I will set a moderately long timeout. Timeout set for
> >>>> next Friday (Jan 17).
> >>>>
> >>>> Some results from osu_latency on haswell:
> >>>>
> >>>> hjelmn_at_cn143 pt2pt]$ mpirun -n 2 --bind-to core -mca btl vader,self
> >>>> ./osu_latency
> >>>> # OSU MPI Latency Test v4.0.1
> >>>> # Size Latency (us)
> >>>> 0 0.11
> >>>> 1 0.14
> >>>> 2 0.14
> >>>> 4 0.14
> >>>> 8 0.14
> >>>> 16 0.14
> >>>> 32 0.15
> >>>> 64 0.18
> >>>> 128 0.36
> >>>> 256 0.37
> >>>> 512 0.46
> >>>> 1024 0.56
> >>>> 2048 0.80
> >>>> 4096 1.12
> >>>> 8192 1.68
> >>>> 16384 2.98
> >>>> 32768 5.10
> >>>> 65536 8.12
> >>>> 131072 14.07
> >>>> 262144 25.30
> >>>> 524288 47.40
> >>>> 1048576 91.71
> >>>> 2097152 195.56
> >>>> 4194304 487.05
> >>>>
> >>>>
> >>>> Patch Attached.
> >>>>
> >>>> -Nathan
> >>> _______________________________________________
> >>> devel mailing list
> >>> devel_at_[hidden]
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>
> >> _______________________________________________
> >> devel mailing list
> >> devel_at_[hidden]
> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >
> >
> > <ob1_optimization_take3.patch>_______________________________________________
> > devel mailing list
> > devel_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



  • application/pgp-signature attachment: stored