Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Gleb Natapov (glebn_at_[hidden])
Date: 2007-08-13 16:10:47


On Mon, Aug 13, 2007 at 03:59:28PM -0400, Richard Graham wrote:
>
>
>
> On 8/13/07 3:52 PM, "Gleb Natapov" <glebn_at_[hidden]> wrote:
>
> > On Mon, Aug 13, 2007 at 09:12:33AM -0600, Galen Shipman wrote:
> > Here are the
> > items we have identified:
> >
> All those things sounds very promising. Is there
> > tmp branch where you
> are going to work on this?
>
> >
> >
>
> tmp/latency
>
> Some changes have already gone in - mainly trying to remove as much as
> possible from the isend/send path, before moving on to the list bellow. Do
> you have cycles to help with this ?
I am very interested, not sure about cycles though. I'll get back from
my vacation next week and look over this list one more time to see where
I can help.

>
> Rich
>
> > ------------------------------------------------------------------------ >
> > ----
> >
> > 1) remove 0 byte optimization of not initializing the convertor
> >
> > This costs us an ³if³ in MCA_PML_BASE_SEND_REQUEST_INIT and an
> > ³if³ in
> > mca_pml_ob1_send_request_start_copy
> > +++
> > Measure the convertor
> > initialization before taking any other action.
> >
> > ------------------------------------------------------------------------ >
> > ----
> >
> >
> > ------------------------------------------------------------------------ >
> > ----
> >
> > 2) get rid of mca_pml_ob1_send_request_start_prepare and
> >
> > mca_pml_ob1_send_request_start_copy by removing the
> >
> > MCA_BTL_FLAGS_SEND_INPLACE flag. Instead we can simply have btl_send
> >
> > return OMPI_SUCCESS if the fragment can be marked as completed and
> >
> > OMPI_NOT_ON_WIRE if the fragment cannot be marked as complete. This
> > solves
> > another problem, with IB if there are a bunch of isends
> > outstanding we end
> > up buffering them all in the btl, marking
> > completion and never get them on
> > the wire because the BTL runs out of
> > credits, we never get credits back
> > until finalize because we never
> > call progress cause the requests are
> > complete. There is one issue
> > here, start_prepare calls prepare_src and
> > start_copy calls alloc, I
> > think we can work around this by just always
> > using prepare_src,
> > OpenIB BTL will give a fragment off the free list
> > anyway because the
> > fragment is less than the eager limit.
> > +++
> > Make the
> > BTL return different return codes for the send. If the
> > fragment is gone,
> > then the PML is responsible of marking the MPI
> > request as completed and so
> > on. Only the updated BTLs will get any
> > benefit from this feature. Add a
> > flag into the descriptor to allow or
> > not the BTL to free the fragment.
> >
> >
> > Add a 3 level flag:
> > - BTL_HAVE_OWNERSHIP : the fragment can be released
> > by the BTL after
> > the send, and then it report back a special return to the
> > PML
> > - BTL_HAVE_OWNERSHIP_AFTER_CALLBACK : the fragment will be released
> >
> > by the BTL once the completion callback was triggered.
> > - PML_HAVE_OWNERSHIP
> > : the BTL is not allowed to release the fragment
> > at all (the PML is
> > responsible for this).
> >
> > Return codes:
> > - done and there will be no
> > callbacks
> > - not done, wait for a callback later
> > - error state
> >
> > ------------------------------------------------------------------------ >
> > ----
> >
> >
> > ------------------------------------------------------------------------ >
> > ----
> >
> > 3) Change the remote callback function (and tag value based on what
> >
> > data we are sending), don't use mca_pml_ob1_recv_frag_callback for
> >
> > everything!
> > I think we need:
> >
> > mca_pml_ob1_recv_frag_match
> >
> > mca_pml_ob1_recv_frag_rndv
> > mca_pml_ob1_recv_frag_rget
> >
> >
> > mca_pml_ob1_recv_match_ack_copy
> > mca_pml_ob1_recv_match_ack_pipeline
> >
> >
> > mca_pml_ob1_recv_copy_frag
> > mca_pml_ob1_recv_put_request
> >
> > mca_pml_ob1_recv_put_fin
> > +++
> > Pass the callback as parameter to the match
> > function will save us 2
> > switches. Add more registrations in the BTL in
> > order to jump directly
> > in the correct function (the first 3 require a
> > match while the others
> > don't). 4 & 4 bits on the tag so each layer will
> > have 4 bits of tags
> > [i.e. first 4 bits for the protocol tag and lower 4
> > bits they are up
> > to the protocol] and the registration table will still be
> > local to
> > each component.
> >
> > ------------------------------------------------------------------------ >
> > ----
> >
> >
> > ------------------------------------------------------------------------ >
> > ----
> >
> > 4) Get rid of mca_pml_ob1_recv_request_progress; this does the same
> >
> > switch on hdr->hdr_common.hdr_type that mca_pml_ob1_recv_frag_callback!
> >
> > I think what we can do here is modify mca_pml_ob1_recv_frag_match to
> > take
> > a function pointer for what it should call on a successful match.
> > So based
> > on the receive callback we can pass the correct scheduling
> > function to
> > invoke into the generic mca_pml_ob1_recv_frag_match
> >
> > Recv_request progress
> > is call in a generic way from multiple places,
> > and we do a big switch
> > inside. In the match function we might want to
> > pass a function pointer to
> > the successful match progress function.
> > This way we will be able to
> > specialize what happens after the match,
> > in a more optimized way. Or the
> > recv_request_match can return the
> > match and then the caller will have to
> > specialize it's action.
> >
> > ------------------------------------------------------------------------ >
> > ----
> >
> >
> > ------------------------------------------------------------------------ >
> > ----
> >
> > 5) Don't initialize the entire request. We can use item 2 below (if
> >
> > we get back OMPI_SUCCESS from btl_send) then we don't need to fully
> >
> > initialize the request, we need the convertor setup but the rest we
> > can
> > pass down the stack in order to setup the match header and setup
> > the
> > request if we get OMPI_NOT_ON_WIRE back from btl_send.
> >
> > I think we need
> > something like:
> > MCA_PML_BASE_SEND_REQUEST_INIT_CONV
> >
> > and
> >
> > MCA_PML_BASE_SEND_REQUEST_INIT_FULL
> >
> > so the first macro just sets up the
> > convertor, the second populates
> > all the rest of the request state in the
> > case that we will need it
> > later because the fragment doesn't hit the
> > wire.
> > +++
> > We all agreed.
> >
> > ------------------------------------------------------------------------ >
> > ----
> >
> >
> >
> > On Aug 13, 2007, at 9:00 AM, Christian Bell wrote:
> >
> > > On
> > Sun, 12 Aug 2007, Gleb Natapov wrote:
> > >
> > >>> Any objections? We can
> > discuss what approaches we want to take
> > >>> (there's going to be some
> > complications because of the PML driver,
> > >>> etc.); perhaps in the Tuesday
> > Mellanox teleconf...?
> > >>>
> > >> My main objection is that the only reason you
> > propose to do this
> > >> is some
> > >> bogus benchmark? Is there any other
> > reason to implement header
> > >> caching?
> > >> I also hope you don't propose
> > to break layering and somehow cache
> > >> PML headers
> > >> in BTL.
> > >
> > >
> > Gleb is hitting the main points I wanted to bring up. We had
> > > examined
> > this header caching in the context of PSM a little while
> > > ago. 0.5us is
> > much more than we had observed -- at 3GHz, 0.5us would
> > > be about 1500
> > cycles of code that has little amounts of branches.
> > > For us, with a much
> > bigger header and more fields to fetch from
> > > different structures, it was
> > more like 350 cycles which is on the
> > > order of 0.1us and not worth the
> > effort (in code complexity,
> > > readability and frankly motivation for
> > performance). Maybe there's
> > > more to it than just "code caching" -- like
> > sending from pre-pinned
> > > headers or using the RDMA with immediate, etc.
> > But I'd be suprised
> > > to find out that openib btl doesn't do the best thing
> > here.
> > >
> > > I have pretty good evidence that for CM, the latency difference
> > comes
> > > from the receive-side (in particular opal_progress). Doesn't the
> >
> > > openib btl receive-side do something similiar with opal_progress,
> > > i.e.
> > register a callback function? It probably does something
> > > different like
> > check a few RDMA mailboxes (or per-peer landing pads)
> > > but anything that
> > gets called before or after it as part of
> > > opal_progress is cause for
> > slowdown.
> > >
> > > . . christian
> > >
> > > --
> > >
> > christian.bell_at_[hidden]
> > > (QLogic Host Solutions Group, formerly
> > Pathscale)
> > > _______________________________________________
> > > devel
> > mailing list
> > > devel_at_[hidden]
> > >
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >
> >
> >
> > _______________________________________________
> > devel mailing list
> >
> > devel_at_[hidden]
> >
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> --
> Gleb.
>
> _____________
> > __________________________________
> devel mailing
> > list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
			Gleb.