Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Galen Shipman (gshipman_at_[hidden])
Date: 2007-05-25 23:31:33


On May 24, 2007, at 2:48 PM, George Bosilca wrote:

> I see the problem this patch try to solve, but I fail to correctly
> understand the implementation. The patch affect all PML and BTL in
> the code base by adding one more argument to some of the most often
> called functions. And there is only one BTL (openib) who seems to
> use it while all others completely ignore it. Moreover, there seems
> to be already a very similar mechanism based on the
> MCA_BTL_DES_FLAGS_PRIORITY flag, which can be set by the PML level
> into the btl_descriptor.
>
> So what's the difference between the additional argument and a
> correct usage of the MCA_BTL_DES_FLAGS_PRIORITY flag ?

The problem is that MCA_BTL_DES_FLAGS_PRIORITY was meant to indicate
that the fragment was higher priority, but the fragment isn't higher
priority. It simply needs to be ordered w.r.t. a previous fragment,
an RDMA in this case.
This being said, we could have just added an rdma fin flag, but this
would mix protocol a bit too much between the BTL and the PML in my
opinion.
What we have with this fix is that the BTL can assign an order tag to
any descriptor if it wishes, this order tag is only valid after a
call to btl_send or btl_put/get. This order tag can then be used to
request another descriptor later that will enforce ordering. The
semantics here are clear, and the BTL doesn't have to do anything if
it doesn't wish (w.r.t. assigning a valid order tag). So this was the
clearest semantics I could come up with that allowed for numerous
implementations at the BTL level. For example, even specifying an
rdma fin flag directly to the BTL would restrict the BTL further than
these semantics because then all RDMA's must be sent on the same
endpoint/QP as all the PML would be able to indicate is that a FIN is
being sent, and the BTL wouldn't have the context to know which RDMA
the FIN belonged to and hence couldn't enforce ordering easily.

The only reason OpenIB is the only one to use this new functionality
is because I haven't had a chance to fix up udapl, which I plan to do
next week.
Note that GM semantics expose a similar problem (ordering is only
guaranteed for messages of the same priority), but myrinet doesn't
buffer like some of the IB/IWARP stuff can so we won't see it there.

There are also a number of optimizations that these semantics allow,
for example, the BTL doesn't have to give local completion callback
on an RDMA anymore, as the fin message can be used for local
completion of both.

I am also looking at adding a BTL_PUT_IMMEDIATE which provides remote
completion via an active message tag callback along with 64 bits of
data, this would allow us to bypass the FIN entirely if the network
supports it, this would be useful for MX as an example. OpenIB also
supports a similar mechanism but there are problems that would need
to be addressed as OpenIB only delivers 32 bits with the remote
completion.

- Galen

>
> george.
>
> On May 24, 2007, at 3:51 PM, gshipman_at_[hidden] wrote:
>
>> Author: gshipman
>> Date: 2007-05-24 15:51:26 EDT (Thu, 24 May 2007)
>> New Revision: 14768
>> URL: https://svn.open-mpi.org/trac/ompi/changeset/14768
>>
>> Log:
>> Add optional ordering to the BTL interface.
>> This is required to tighten up the BTL semantics. Ordering is not
>> guaranteed,
>> but, if the BTL returns a order tag in a descriptor (other than
>> MCA_BTL_NO_ORDER) then we may request another descriptor that will
>> obey
>> ordering w.r.t. to the other descriptor.
>>
>>
>> This will allow sane behavior for RDMA networks, where local
>> completion of an
>> RDMA operation on the active side does not imply remote completion
>> on the
>> passive side. If we send a FIN message after local completion and
>> the FIN is
>> not ordered w.r.t. the RDMA operation then badness may occur as
>> the passive
>> side may now try to deregister the memory and the RDMA operation
>> may still be
>> pending on the passive side.
>>
>> Note that this has no impact on networks that don't suffer from this
>> limitation as the ORDER tag can simply always be specified as
>> MCA_BTL_NO_ORDER.
>>
>>
>>
>>
>>
>> Text files modified:
>> trunk/ompi/mca/bml/bml.h | 29 +++
>> ++++++++++++--------
>> trunk/ompi/mca/btl/btl.h | 10 +++
>> +++++
>> trunk/ompi/mca/btl/gm/btl_gm.c | 8 +++
>> +++
>> trunk/ompi/mca/btl/gm/btl_gm.h | 3 ++
>> trunk/ompi/mca/btl/mx/btl_mx.c | 8 +++
>> +++
>> trunk/ompi/mca/btl/mx/btl_mx.h | 3 ++
>> trunk/ompi/mca/btl/openib/btl_openib.c | 49 +++
>> +++++++++++++++++++++++++++++++++++-
>> trunk/ompi/mca/btl/openib/btl_openib.h | 3 ++
>> trunk/ompi/mca/btl/openib/btl_openib_endpoint.c | 7 +++--
>> trunk/ompi/mca/btl/openib/btl_openib_frag.c | 7 +++++
>> trunk/ompi/mca/btl/portals/btl_portals.c | 8 ++++-
>> trunk/ompi/mca/btl/portals/btl_portals.h | 3 ++
>> trunk/ompi/mca/btl/self/btl_self.c | 3 ++
>> trunk/ompi/mca/btl/self/btl_self.h | 3 ++
>> trunk/ompi/mca/btl/sm/btl_sm.c | 2 +
>> trunk/ompi/mca/btl/sm/btl_sm.h | 2 +
>> trunk/ompi/mca/btl/tcp/btl_tcp.c | 6 ++++
>> trunk/ompi/mca/btl/tcp/btl_tcp.h | 3 ++
>> trunk/ompi/mca/btl/template/btl_template.c | 8 ++++-
>> trunk/ompi/mca/btl/template/btl_template.h | 3 ++
>> trunk/ompi/mca/btl/template/btl_template_component.c | 10 +++
>> +---
>> trunk/ompi/mca/btl/udapl/btl_udapl.c | 11 +++
>> +++--
>> trunk/ompi/mca/btl/udapl/btl_udapl.h | 3 ++
>> trunk/ompi/mca/btl/udapl/btl_udapl_component.c | 17 +++
>> +++++----
>> trunk/ompi/mca/osc/rdma/osc_rdma_data_move.c | 3 ++
>> trunk/ompi/mca/pml/dr/pml_dr.h | 6 ++--
>> trunk/ompi/mca/pml/dr/pml_dr_sendreq.c | 12 +++
>> +++++-
>> trunk/ompi/mca/pml/dr/pml_dr_sendreq.h | 3 +
>> trunk/ompi/mca/pml/ob1/pml_ob1.c | 17 +++
>> +++++-----
>> trunk/ompi/mca/pml/ob1/pml_ob1.h | 44 +++
>> ++------------------------------
>> trunk/ompi/mca/pml/ob1/pml_ob1_recvreq.c | 14 +++
>> ++++----
>> trunk/ompi/mca/pml/ob1/pml_ob1_sendreq.c | 28 +++
>> +++++++++++++------
>> 32 files changed, 241 insertions(+), 95 deletions(-)
>>
>>
>> Diff not shown due to size (53504 bytes).
>> To see the diff, run the following command:
>>
>> svn diff -r 14767:14768 --no-diff-deleted
>>
>> _______________________________________________
>> svn mailing list
>> svn_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/svn
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel