Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC: OB1 optimizations
From: George Bosilca (bosilca_at_[hidden])
Date: 2014-01-10 08:24:19


Nathan,

When you get access to the machine it might be interesting to show not only the after-patch performance but also what the trunk is getting on the same architecture.

  George.

On Jan 8, 2014, at 18:09 , Nathan Hjelm <hjelmn_at_[hidden]> wrote:

> Yeah. Its hard to say what the results will look like on Haswell. I
> expect they should show some improvement from George's change but we
> won't know until I can get to a Haswell node. Hopefully one becomes
> available today.
>
> -Nathan
>
> On Wed, Jan 08, 2014 at 08:59:34AM -0800, Paul Hargrove wrote:
>> Nevermind, since Nathan just clarified that the results are not
>> comparable.
>>
>> -Paul [Sent from my phone]
>>
>> On Jan 8, 2014 8:58 AM, "Paul Hargrove" <phhargrove_at_[hidden]> wrote:
>>
>> Interestingly enough the 4MB latency actually improved significantly
>> relative to the initial numbers.
>>
>> -Paul [Sent from my phone]
>>
>> On Jan 8, 2014 8:50 AM, "George Bosilca" <bosilca_at_[hidden]> wrote:
>>
>> These results are way worst that the one you send on your previous
>> email? What is the reason?
>>
>> George.
>>
>> On Jan 8, 2014, at 17:33 , Nathan Hjelm <hjelmn_at_[hidden]> wrote:
>>
>>> Ah, good catch. A new version is attached that should eliminate the
>> race
>>> window for the multi-threaded case. Performance numbers are still
>>> looking really good. We beat mvapich2 in the small message ping-pong
>> by
>>> a good margin. See the results below. The large message latency
>>> difference for large messages is probably due to a difference in the
>> max
>>> send size for vader vs mvapich.
>>>
>>> To answer Pasha's question. I don't see a noticiable difference in
>>> performance for btl's with no sendi function (this includes
>>> ugni). OpenIB should get a boost. I will test that once I get an
>>> allocation.
>>>
>>> CPU: Xeon E5-2670 @ 2.60 GHz
>>>
>>> Open MPI (-mca btl vader,self):
>>> # OSU MPI Latency Test v4.1
>>> # Size Latency (us)
>>> 0 0.17
>>> 1 0.19
>>> 2 0.19
>>> 4 0.19
>>> 8 0.19
>>> 16 0.19
>>> 32 0.19
>>> 64 0.40
>>> 128 0.40
>>> 256 0.43
>>> 512 0.52
>>> 1024 0.67
>>> 2048 0.94
>>> 4096 1.44
>>> 8192 2.04
>>> 16384 3.47
>>> 32768 6.10
>>> 65536 9.38
>>> 131072 16.47
>>> 262144 29.63
>>> 524288 54.81
>>> 1048576 106.63
>>> 2097152 206.84
>>> 4194304 421.26
>>>
>>>
>>> mvapich2 1.9:
>>> # OSU MPI Latency Test
>>> # Size Latency (us)
>>> 0 0.23
>>> 1 0.23
>>> 2 0.23
>>> 4 0.23
>>> 8 0.23
>>> 16 0.28
>>> 32 0.28
>>> 64 0.39
>>> 128 0.40
>>> 256 0.40
>>> 512 0.42
>>> 1024 0.51
>>> 2048 0.71
>>> 4096 1.02
>>> 8192 1.60
>>> 16384 3.47
>>> 32768 5.05
>>> 65536 8.06
>>> 131072 14.82
>>> 262144 28.15
>>> 524288 53.69
>>> 1048576 127.47
>>> 2097152 235.58
>>> 4194304 683.90
>>>
>>>
>>> -Nathan
>>>
>>> On Tue, Jan 07, 2014 at 06:23:13PM -0700, George Bosilca wrote:
>>>> The local request is not correctly released, leading to assert in
>> debug
>>>> mode. This is because you avoid calling
>> MCA_PML_BASE_RECV_REQUEST_FINI,
>>>> fact that leaves the request in an ACTIVE state, condition
>> carefully
>>>> checked during the call to destructor.
>>>>
>>>> I attached a second patch that fixes the issue above, and
>> implement a
>>>> similar optimization for the blocking send.
>>>>
>>>> Unfortunately, this is not enough. The mca_pml_ob1_send_inline
>>>> optimization is horribly wrong in a multithreaded case as it
>> alter the
>>>> send_sequence without storing it. If you create a gap in the
>> send_sequence
>>>> a deadlock will __definitively__ occur. I strongly suggest you
>> turn off
>>>> the mca_pml_ob1_send_inline optimization on the multithreaded
>> case. All
>>>> the others optimizations should be safe in all cases.
>>>>
>>>> George.
>>>>
>>>> On Jan 8, 2014, at 01:15 , Shamis, Pavel <shamisp_at_[hidden]>
>> wrote:
>>>>
>>>>> Overall it looks good. It would be helpful to validate performance
>>>> numbers for other interconnects as well.
>>>>> -Pasha
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: devel [mailto:devel-bounces_at_[hidden]] On Behalf Of
>> Nathan
>>>>>> Hjelm
>>>>>> Sent: Tuesday, January 07, 2014 6:45 PM
>>>>>> To: Open MPI Developers List
>>>>>> Subject: [OMPI devel] RFC: OB1 optimizations
>>>>>>
>>>>>> What: Push some ob1 optimizations to the trunk and 1.7.5.
>>>>>>
>>>>>> What: This patch contains two optimizations:
>>>>>>
>>>>>> - Introduce a fast send path for blocking send calls. This path
>> uses
>>>>>> the btl sendi function to put the data on the wire without the
>> need
>>>>>> for setting up a send request. In the case of btl/vader this
>> can
>>>>>> also avoid allocating/initializing a new fragment. With
>> btl/vader
>>>>>> this optimization improves small message latency by 50-200ns in
>>>>>> ping-pong type benchmarks. Larger messages may take a small hit
>> in
>>>>>> the range of 10-20ns.
>>>>>>
>>>>>> - Use a stack-allocated receive request for blocking recieves.
>> This
>>>>>> optimization saves the extra instructions associated with
>> accessing
>>>>>> the receive request free list. I was able to get another
>> 50-200ns
>>>>>> improvement in the small-message ping-pong with this
>> optimization. I
>>>>>> see no hit for larger messages.
>>>>>>
>>>>>> When: These changes touch the critical path in ob1 and are
>> targeted for
>>>>>> 1.7.5. As such I will set a moderately long timeout. Timeout set
>> for
>>>>>> next Friday (Jan 17).
>>>>>>
>>>>>> Some results from osu_latency on haswell:
>>>>>>
>>>>>> hjelmn_at_cn143 pt2pt]$ mpirun -n 2 --bind-to core -mca btl
>> vader,self
>>>>>> ./osu_latency
>>>>>> # OSU MPI Latency Test v4.0.1
>>>>>> # Size Latency (us)
>>>>>> 0 0.11
>>>>>> 1 0.14
>>>>>> 2 0.14
>>>>>> 4 0.14
>>>>>> 8 0.14
>>>>>> 16 0.14
>>>>>> 32 0.15
>>>>>> 64 0.18
>>>>>> 128 0.36
>>>>>> 256 0.37
>>>>>> 512 0.46
>>>>>> 1024 0.56
>>>>>> 2048 0.80
>>>>>> 4096 1.12
>>>>>> 8192 1.68
>>>>>> 16384 2.98
>>>>>> 32768 5.10
>>>>>> 65536 8.12
>>>>>> 131072 14.07
>>>>>> 262144 25.30
>>>>>> 524288 47.40
>>>>>> 1048576 91.71
>>>>>> 2097152 195.56
>>>>>> 4194304 487.05
>>>>>>
>>>>>>
>>>>>> Patch Attached.
>>>>>>
>>>>>> -Nathan
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>>
>>>
>> <ob1_optimization_take3.patch>_______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel