Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Galen Shipman (gshipman_at_[hidden])
Date: 2007-08-13 11:12:33


I think we need to take a step back from micro-optimizations such as
header caching.

Rich, George, Brian and I are currently looking into latency
improvements. We came up with several areas of performance
enhancements that can be done with minimal disruption. The progress
issue that Christian and others have pointed does appear to be a
problem, but will take a bit more work. I would like to see progress
in these areas first as I really don't like the idea of caching more
endpoint state in OMPI for micro-benchmark latency improvements until
we are certain we have done the ground work for improving latency in
the general case.

Here are the items we have identified:

------------------------------------------------------------------------

----
1) remove 0 byte optimization of not initializing the convertor
      This costs us an “if“ in MCA_PML_BASE_SEND_REQUEST_INIT and an  
“if“ in mca_pml_ob1_send_request_start_copy
+++
Measure the convertor initialization before taking any other action.
------------------------------------------------------------------------ 
----
------------------------------------------------------------------------ 
----
2) get rid of mca_pml_ob1_send_request_start_prepare and  
mca_pml_ob1_send_request_start_copy by removing the  
MCA_BTL_FLAGS_SEND_INPLACE flag. Instead we can simply have btl_send  
return OMPI_SUCCESS if the fragment can be marked as completed and  
OMPI_NOT_ON_WIRE if the fragment cannot be marked as complete. This  
solves another problem, with IB if there are a bunch of isends  
outstanding we end up buffering them all in the btl, marking  
completion and never get them on the wire because the BTL runs out of  
credits, we never get credits back until finalize because we never  
call progress cause the requests are complete.  There is one issue  
here, start_prepare calls prepare_src and start_copy calls alloc, I  
think we can work around this by just always using prepare_src,  
OpenIB BTL will give a fragment off the free list anyway because the  
fragment is less than the eager limit.
+++
Make the BTL return different return codes for the send. If the  
fragment is gone, then the PML is responsible of marking the MPI  
request as completed and so on. Only the updated BTLs will get any  
benefit from this feature. Add a flag into the descriptor to allow or  
not the BTL to free the fragment.
Add a 3 level flag:
- BTL_HAVE_OWNERSHIP : the fragment can be released by the BTL after  
the send, and then it report back a special return to the PML
- BTL_HAVE_OWNERSHIP_AFTER_CALLBACK : the fragment will be released  
by the BTL once the completion callback was triggered.
- PML_HAVE_OWNERSHIP : the BTL is not allowed to release the fragment  
at all (the PML is responsible for this).
Return codes:
- done and there will be no callbacks
- not done, wait for a callback later
- error state
------------------------------------------------------------------------ 
----
------------------------------------------------------------------------ 
----
3) Change the remote callback function (and tag value based on what  
data we are sending), don't use mca_pml_ob1_recv_frag_callback for  
everything!
	I think we need:
	mca_pml_ob1_recv_frag_match
	mca_pml_ob1_recv_frag_rndv
	mca_pml_ob1_recv_frag_rget
	
	mca_pml_ob1_recv_match_ack_copy
	mca_pml_ob1_recv_match_ack_pipeline
	
	mca_pml_ob1_recv_copy_frag
	mca_pml_ob1_recv_put_request
	mca_pml_ob1_recv_put_fin
+++
Pass the callback as parameter to the match function will save us 2  
switches. Add more registrations in the BTL in order to jump directly  
in the correct function (the first 3 require a match while the others  
don't). 4 & 4 bits on the tag so each layer will have 4 bits of tags  
[i.e. first 4 bits for the protocol tag and lower 4 bits they are up  
to the protocol] and the registration table will still be local to  
each component.
------------------------------------------------------------------------ 
----
------------------------------------------------------------------------ 
----
4) Get rid of mca_pml_ob1_recv_request_progress; this does the same  
switch on hdr->hdr_common.hdr_type that mca_pml_ob1_recv_frag_callback!
	I think what we can do here is modify mca_pml_ob1_recv_frag_match to  
take a function pointer for what it should call on a successful match.
	So based on the receive callback we can pass the correct scheduling  
function to invoke into the generic mca_pml_ob1_recv_frag_match
Recv_request progress is call in a generic way from multiple places,  
and we do a big switch inside. In the match function we might want to  
pass a function pointer to the successful match progress function.  
This way we will be able to specialize what happens after the match,  
in a more optimized way. Or the recv_request_match can return the  
match and then the caller will have to specialize it's action.
------------------------------------------------------------------------ 
----
------------------------------------------------------------------------ 
----
5) Don't initialize the entire request. We can use item 2 below (if  
we get back OMPI_SUCCESS from btl_send) then we don't need to fully  
initialize the request, we need the convertor setup but the rest we  
can pass down the stack in order to setup the match header and setup  
the request if we get OMPI_NOT_ON_WIRE back from btl_send.
I think we need something like:
MCA_PML_BASE_SEND_REQUEST_INIT_CONV
and
MCA_PML_BASE_SEND_REQUEST_INIT_FULL
so the first macro just sets up the convertor, the second populates  
all the rest of the request state in the case that we will need it  
later because the fragment doesn't hit the wire.
+++
We all agreed.
------------------------------------------------------------------------ 
----
On Aug 13, 2007, at 9:00 AM, Christian Bell wrote:
> On Sun, 12 Aug 2007, Gleb Natapov wrote:
>
>>> Any objections?  We can discuss what approaches we want to take
>>> (there's going to be some complications because of the PML driver,
>>> etc.); perhaps in the Tuesday Mellanox teleconf...?
>>>
>> My main objection is that the only reason you propose to do this  
>> is some
>> bogus benchmark? Is there any other reason to implement header  
>> caching?
>> I also hope you don't propose to break layering and somehow cache  
>> PML headers
>> in BTL.
>
> Gleb is hitting the main points I wanted to bring up.  We had
> examined this header caching in the context of PSM a little while
> ago.  0.5us is much more than we had observed -- at 3GHz, 0.5us would
> be about 1500 cycles of code that has little amounts of branches.
> For us, with a much bigger header and more fields to fetch from
> different structures, it was more like 350 cycles which is on the
> order of 0.1us and not worth the effort (in code complexity,
> readability and frankly motivation for performance).  Maybe there's
> more to it than just "code caching" -- like sending from pre-pinned
> headers or using the RDMA with immediate, etc.  But I'd be suprised
> to find out that openib btl doesn't do the best thing here.
>
> I have pretty good evidence that for CM, the latency difference comes
> from the receive-side (in particular opal_progress).  Doesn't the
> openib btl receive-side do something similiar with opal_progress,
> i.e. register a callback function?  It probably does something
> different like check a few RDMA mailboxes (or per-peer landing pads)
> but anything that gets called before or after it as part of
> opal_progress is cause for slowdown.
>
>     . . christian
>
> -- 
> christian.bell_at_[hidden]
> (QLogic Host Solutions Group, formerly Pathscale)
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel