Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Gleb Natapov (glebn_at_[hidden])
Date: 2007-08-13 15:52:16


On Mon, Aug 13, 2007 at 09:12:33AM -0600, Galen Shipman wrote:
> Here are the items we have identified:
>
All those things sounds very promising. Is there tmp branch where you
are going to work on this?

>
> ------------------------------------------------------------------------
> ----
>
> 1) remove 0 byte optimization of not initializing the convertor
> This costs us an “if“ in MCA_PML_BASE_SEND_REQUEST_INIT and an
> “if“ in mca_pml_ob1_send_request_start_copy
> +++
> Measure the convertor initialization before taking any other action.
> ------------------------------------------------------------------------
> ----
>
> ------------------------------------------------------------------------
> ----
>
> 2) get rid of mca_pml_ob1_send_request_start_prepare and
> mca_pml_ob1_send_request_start_copy by removing the
> MCA_BTL_FLAGS_SEND_INPLACE flag. Instead we can simply have btl_send
> return OMPI_SUCCESS if the fragment can be marked as completed and
> OMPI_NOT_ON_WIRE if the fragment cannot be marked as complete. This
> solves another problem, with IB if there are a bunch of isends
> outstanding we end up buffering them all in the btl, marking
> completion and never get them on the wire because the BTL runs out of
> credits, we never get credits back until finalize because we never
> call progress cause the requests are complete. There is one issue
> here, start_prepare calls prepare_src and start_copy calls alloc, I
> think we can work around this by just always using prepare_src,
> OpenIB BTL will give a fragment off the free list anyway because the
> fragment is less than the eager limit.
> +++
> Make the BTL return different return codes for the send. If the
> fragment is gone, then the PML is responsible of marking the MPI
> request as completed and so on. Only the updated BTLs will get any
> benefit from this feature. Add a flag into the descriptor to allow or
> not the BTL to free the fragment.
>
> Add a 3 level flag:
> - BTL_HAVE_OWNERSHIP : the fragment can be released by the BTL after
> the send, and then it report back a special return to the PML
> - BTL_HAVE_OWNERSHIP_AFTER_CALLBACK : the fragment will be released
> by the BTL once the completion callback was triggered.
> - PML_HAVE_OWNERSHIP : the BTL is not allowed to release the fragment
> at all (the PML is responsible for this).
>
> Return codes:
> - done and there will be no callbacks
> - not done, wait for a callback later
> - error state
> ------------------------------------------------------------------------
> ----
>
> ------------------------------------------------------------------------
> ----
>
> 3) Change the remote callback function (and tag value based on what
> data we are sending), don't use mca_pml_ob1_recv_frag_callback for
> everything!
> I think we need:
>
> mca_pml_ob1_recv_frag_match
> mca_pml_ob1_recv_frag_rndv
> mca_pml_ob1_recv_frag_rget
>
> mca_pml_ob1_recv_match_ack_copy
> mca_pml_ob1_recv_match_ack_pipeline
>
> mca_pml_ob1_recv_copy_frag
> mca_pml_ob1_recv_put_request
> mca_pml_ob1_recv_put_fin
> +++
> Pass the callback as parameter to the match function will save us 2
> switches. Add more registrations in the BTL in order to jump directly
> in the correct function (the first 3 require a match while the others
> don't). 4 & 4 bits on the tag so each layer will have 4 bits of tags
> [i.e. first 4 bits for the protocol tag and lower 4 bits they are up
> to the protocol] and the registration table will still be local to
> each component.
> ------------------------------------------------------------------------
> ----
>
> ------------------------------------------------------------------------
> ----
>
> 4) Get rid of mca_pml_ob1_recv_request_progress; this does the same
> switch on hdr->hdr_common.hdr_type that mca_pml_ob1_recv_frag_callback!
> I think what we can do here is modify mca_pml_ob1_recv_frag_match to
> take a function pointer for what it should call on a successful match.
> So based on the receive callback we can pass the correct scheduling
> function to invoke into the generic mca_pml_ob1_recv_frag_match
>
> Recv_request progress is call in a generic way from multiple places,
> and we do a big switch inside. In the match function we might want to
> pass a function pointer to the successful match progress function.
> This way we will be able to specialize what happens after the match,
> in a more optimized way. Or the recv_request_match can return the
> match and then the caller will have to specialize it's action.
> ------------------------------------------------------------------------
> ----
>
> ------------------------------------------------------------------------
> ----
>
> 5) Don't initialize the entire request. We can use item 2 below (if
> we get back OMPI_SUCCESS from btl_send) then we don't need to fully
> initialize the request, we need the convertor setup but the rest we
> can pass down the stack in order to setup the match header and setup
> the request if we get OMPI_NOT_ON_WIRE back from btl_send.
>
> I think we need something like:
> MCA_PML_BASE_SEND_REQUEST_INIT_CONV
>
> and
> MCA_PML_BASE_SEND_REQUEST_INIT_FULL
>
> so the first macro just sets up the convertor, the second populates
> all the rest of the request state in the case that we will need it
> later because the fragment doesn't hit the wire.
> +++
> We all agreed.
> ------------------------------------------------------------------------
> ----
>
>
>
> On Aug 13, 2007, at 9:00 AM, Christian Bell wrote:
>
> > On Sun, 12 Aug 2007, Gleb Natapov wrote:
> >
> >>> Any objections? We can discuss what approaches we want to take
> >>> (there's going to be some complications because of the PML driver,
> >>> etc.); perhaps in the Tuesday Mellanox teleconf...?
> >>>
> >> My main objection is that the only reason you propose to do this
> >> is some
> >> bogus benchmark? Is there any other reason to implement header
> >> caching?
> >> I also hope you don't propose to break layering and somehow cache
> >> PML headers
> >> in BTL.
> >
> > Gleb is hitting the main points I wanted to bring up. We had
> > examined this header caching in the context of PSM a little while
> > ago. 0.5us is much more than we had observed -- at 3GHz, 0.5us would
> > be about 1500 cycles of code that has little amounts of branches.
> > For us, with a much bigger header and more fields to fetch from
> > different structures, it was more like 350 cycles which is on the
> > order of 0.1us and not worth the effort (in code complexity,
> > readability and frankly motivation for performance). Maybe there's
> > more to it than just "code caching" -- like sending from pre-pinned
> > headers or using the RDMA with immediate, etc. But I'd be suprised
> > to find out that openib btl doesn't do the best thing here.
> >
> > I have pretty good evidence that for CM, the latency difference comes
> > from the receive-side (in particular opal_progress). Doesn't the
> > openib btl receive-side do something similiar with opal_progress,
> > i.e. register a callback function? It probably does something
> > different like check a few RDMA mailboxes (or per-peer landing pads)
> > but anything that gets called before or after it as part of
> > opal_progress is cause for slowdown.
> >
> > . . christian
> >
> > --
> > christian.bell_at_[hidden]
> > (QLogic Host Solutions Group, formerly Pathscale)
> > _______________________________________________
> > devel mailing list
> > devel_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
			Gleb.