The local request is not correctly released, leading to assert in debug mode. This is because you avoid calling MCA_PML_BASE_RECV_REQUEST_FINI, fact that leaves the request in an ACTIVE state, condition carefully checked during the call to destructor.
I attached a second patch that fixes the issue above, and implement a similar optimization for the blocking send.
Unfortunately, this is not enough. The mca_pml_ob1_send_inline optimization is horribly wrong in a multithreaded case as it alter the send_sequence without storing it. If you create a gap in the send_sequence a deadlock will __definitively__ occur. I strongly suggest you turn off the mca_pml_ob1_send_inline optimization on the multithreaded case. All the others optimizations should be safe in all cases.
On Jan 8, 2014, at 01:15 , Shamis, Pavel <shamisp_at_[hidden]> wrote:
> Overall it looks good. It would be helpful to validate performance numbers for other interconnects as well.
>> -----Original Message-----
>> From: devel [mailto:devel-bounces_at_[hidden]] On Behalf Of Nathan
>> Sent: Tuesday, January 07, 2014 6:45 PM
>> To: Open MPI Developers List
>> Subject: [OMPI devel] RFC: OB1 optimizations
>> What: Push some ob1 optimizations to the trunk and 1.7.5.
>> What: This patch contains two optimizations:
>> - Introduce a fast send path for blocking send calls. This path uses
>> the btl sendi function to put the data on the wire without the need
>> for setting up a send request. In the case of btl/vader this can
>> also avoid allocating/initializing a new fragment. With btl/vader
>> this optimization improves small message latency by 50-200ns in
>> ping-pong type benchmarks. Larger messages may take a small hit in
>> the range of 10-20ns.
>> - Use a stack-allocated receive request for blocking recieves. This
>> optimization saves the extra instructions associated with accessing
>> the receive request free list. I was able to get another 50-200ns
>> improvement in the small-message ping-pong with this optimization. I
>> see no hit for larger messages.
>> When: These changes touch the critical path in ob1 and are targeted for
>> 1.7.5. As such I will set a moderately long timeout. Timeout set for
>> next Friday (Jan 17).
>> Some results from osu_latency on haswell:
>> hjelmn_at_cn143 pt2pt]$ mpirun -n 2 --bind-to core -mca btl vader,self
>> # OSU MPI Latency Test v4.0.1
>> # Size Latency (us)
>> 0 0.11
>> 1 0.14
>> 2 0.14
>> 4 0.14
>> 8 0.14
>> 16 0.14
>> 32 0.15
>> 64 0.18
>> 128 0.36
>> 256 0.37
>> 512 0.46
>> 1024 0.56
>> 2048 0.80
>> 4096 1.12
>> 8192 1.68
>> 16384 2.98
>> 32768 5.10
>> 65536 8.12
>> 131072 14.07
>> 262144 25.30
>> 524288 47.40
>> 1048576 91.71
>> 2097152 195.56
>> 4194304 487.05
>> Patch Attached.
> devel mailing list