Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC: OB1 optimizations
From: George Bosilca (bosilca_at_[hidden])
Date: 2014-01-08 11:50:34


These results are way worst that the one you send on your previous email? What is the reason?

  George.

On Jan 8, 2014, at 17:33 , Nathan Hjelm <hjelmn_at_[hidden]> wrote:

> Ah, good catch. A new version is attached that should eliminate the race
> window for the multi-threaded case. Performance numbers are still
> looking really good. We beat mvapich2 in the small message ping-pong by
> a good margin. See the results below. The large message latency
> difference for large messages is probably due to a difference in the max
> send size for vader vs mvapich.
>
> To answer Pasha's question. I don't see a noticiable difference in
> performance for btl's with no sendi function (this includes
> ugni). OpenIB should get a boost. I will test that once I get an
> allocation.
>
> CPU: Xeon E5-2670 @ 2.60 GHz
>
> Open MPI (-mca btl vader,self):
> # OSU MPI Latency Test v4.1
> # Size Latency (us)
> 0 0.17
> 1 0.19
> 2 0.19
> 4 0.19
> 8 0.19
> 16 0.19
> 32 0.19
> 64 0.40
> 128 0.40
> 256 0.43
> 512 0.52
> 1024 0.67
> 2048 0.94
> 4096 1.44
> 8192 2.04
> 16384 3.47
> 32768 6.10
> 65536 9.38
> 131072 16.47
> 262144 29.63
> 524288 54.81
> 1048576 106.63
> 2097152 206.84
> 4194304 421.26
>
>
> mvapich2 1.9:
> # OSU MPI Latency Test
> # Size Latency (us)
> 0 0.23
> 1 0.23
> 2 0.23
> 4 0.23
> 8 0.23
> 16 0.28
> 32 0.28
> 64 0.39
> 128 0.40
> 256 0.40
> 512 0.42
> 1024 0.51
> 2048 0.71
> 4096 1.02
> 8192 1.60
> 16384 3.47
> 32768 5.05
> 65536 8.06
> 131072 14.82
> 262144 28.15
> 524288 53.69
> 1048576 127.47
> 2097152 235.58
> 4194304 683.90
>
>
> -Nathan
>
> On Tue, Jan 07, 2014 at 06:23:13PM -0700, George Bosilca wrote:
>> The local request is not correctly released, leading to assert in debug
>> mode. This is because you avoid calling MCA_PML_BASE_RECV_REQUEST_FINI,
>> fact that leaves the request in an ACTIVE state, condition carefully
>> checked during the call to destructor.
>>
>> I attached a second patch that fixes the issue above, and implement a
>> similar optimization for the blocking send.
>>
>> Unfortunately, this is not enough. The mca_pml_ob1_send_inline
>> optimization is horribly wrong in a multithreaded case as it alter the
>> send_sequence without storing it. If you create a gap in the send_sequence
>> a deadlock will __definitively__ occur. I strongly suggest you turn off
>> the mca_pml_ob1_send_inline optimization on the multithreaded case. All
>> the others optimizations should be safe in all cases.
>>
>> George.
>>
>> On Jan 8, 2014, at 01:15 , Shamis, Pavel <shamisp_at_[hidden]> wrote:
>>
>>> Overall it looks good. It would be helpful to validate performance
>> numbers for other interconnects as well.
>>> -Pasha
>>>
>>>> -----Original Message-----
>>>> From: devel [mailto:devel-bounces_at_[hidden]] On Behalf Of Nathan
>>>> Hjelm
>>>> Sent: Tuesday, January 07, 2014 6:45 PM
>>>> To: Open MPI Developers List
>>>> Subject: [OMPI devel] RFC: OB1 optimizations
>>>>
>>>> What: Push some ob1 optimizations to the trunk and 1.7.5.
>>>>
>>>> What: This patch contains two optimizations:
>>>>
>>>> - Introduce a fast send path for blocking send calls. This path uses
>>>> the btl sendi function to put the data on the wire without the need
>>>> for setting up a send request. In the case of btl/vader this can
>>>> also avoid allocating/initializing a new fragment. With btl/vader
>>>> this optimization improves small message latency by 50-200ns in
>>>> ping-pong type benchmarks. Larger messages may take a small hit in
>>>> the range of 10-20ns.
>>>>
>>>> - Use a stack-allocated receive request for blocking recieves. This
>>>> optimization saves the extra instructions associated with accessing
>>>> the receive request free list. I was able to get another 50-200ns
>>>> improvement in the small-message ping-pong with this optimization. I
>>>> see no hit for larger messages.
>>>>
>>>> When: These changes touch the critical path in ob1 and are targeted for
>>>> 1.7.5. As such I will set a moderately long timeout. Timeout set for
>>>> next Friday (Jan 17).
>>>>
>>>> Some results from osu_latency on haswell:
>>>>
>>>> hjelmn_at_cn143 pt2pt]$ mpirun -n 2 --bind-to core -mca btl vader,self
>>>> ./osu_latency
>>>> # OSU MPI Latency Test v4.0.1
>>>> # Size Latency (us)
>>>> 0 0.11
>>>> 1 0.14
>>>> 2 0.14
>>>> 4 0.14
>>>> 8 0.14
>>>> 16 0.14
>>>> 32 0.15
>>>> 64 0.18
>>>> 128 0.36
>>>> 256 0.37
>>>> 512 0.46
>>>> 1024 0.56
>>>> 2048 0.80
>>>> 4096 1.12
>>>> 8192 1.68
>>>> 16384 2.98
>>>> 32768 5.10
>>>> 65536 8.12
>>>> 131072 14.07
>>>> 262144 25.30
>>>> 524288 47.40
>>>> 1048576 91.71
>>>> 2097152 195.56
>>>> 4194304 487.05
>>>>
>>>>
>>>> Patch Attached.
>>>>
>>>> -Nathan
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> <ob1_optimization_take3.patch>_______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel