Gleb Natapov wrote:
> On Sun, Mar 09, 2008 at 02:48:09PM -0500, Jon Mason wrote:
>> Issue (as described by Steve Wise):
>> Currently OMPI uses qp 0 for all credit updates (by design). This breaks
>> when running over the chelsio rnic due to a race condition between
>> advertising the availability of a buffer using qp0 when the buffer was
>> posted on one of the other qps. It is possible (and easily reproducible)
>> that the peer gets the advertisement and sends data into the qp in question
>> _before_ the rnic has processed the recv buffer and made it available for
>> placement. This results in a connection termination. BTW, other hca's
>> have this issue too. ehca, for example, claims they have the same race
>> condition. I think the timing hole is much smaller though for devices that
>> have 2 separate work queues for the SQ and RQ of a QP. Chelsio has a
>> single work queue to implement both SQ and RQ, so processing of RQ work
>> requests gets queued up behind pending SQ entries which can make this race
>> condition more prevalent.
> There was a discussion about this on openfabrics mailing list and the
> conclusion was that what Open MPI does is correct according to IB/iWarp
Hey Gleb. Yes, the conclusion was the rdma device and driver should
ensure this. But also note that the ehca IB device also has this same
race condition. So I wonder if the other IB devices really do also have
this race condition? I think it is worse for the cxgb3 device due to
its architecture (a single queue for both send and recv work requests).
>> I don't know of any way to avoid this issue other that to ensure that all
>> credit updates for qp X are posted only on qp X. If we do this, then the
>> chelsio HW/FW ensures that the RECV is posted before the subsequent send
>> operation that advertises the buffer is processed.
> Is it possible to fix your FW to follow iWarp spec? Perhaps it is
> possible to implement ibv_post_recv() so that it will not return before
> post receive is processed?
I've been trying come up with a solution in the lib/driver/fw to enforce
this behavior. The only way I can see doing it is to follow the recv
work requests with a 0B write work request, and spinning or blocking
until the 0B write completes (note: 0B write doesn't emit anything on
the wire for the cxgb3 device). This will guarantee that the recv's are
ready before returning from the libcxgb3 post_recv function. However
this is problematic because there can be real OMPI work completions in
the CQ that need processing. So I don't know how to do this in the
Also note, any such solution will entirely drain the SQ each time a recv
is posted. This will kill performance.
(just thinking out loud here): The OMPi code could be designed to _not_
assume recv's are posted until the CPC indicates they are ready. IE sort
of asynchronous behavior. When the recvs are ready, the CPC could
up-call the btl and then the credits could be updated. This sounds
painful though :)
>> To address this Jeff Squyres recommends:
>> 1. make an mca parameter that governs this behavior (i.e., whether to send
>> all flow control messages on QP0 or on their respective QPs)
>> 2. extend the ini file parsing code to accept this parameter as well (need
>> to add a strcmp or two)
>> 3. extend the ini file to fill in this value for all the nic's listed (to
>> include yours).
>> 4. extend the logic in the rest of the btl to send the flow control
>> messages either across qp0 or the respective qp, depending on the value of
>> the mca param / ini value.
>> I am happy to do the work to enable this, but I would like to get
>> everyone's feed back before I start down this path. Jeff said Gleb did
>> the work to change openib to behave this way, so any insight would be
> I personally don't like the idea to add another layer of complexity to openib
> BTL code just to work around HW that doesn't follow spec. If work around
> is simple that is OK, but in this case it is not so simple and will add
> code path that is rarely tested. A simple workaround for the problem may
> be to not configure multiple QPs if HW has a bug (and we can extend ini
> file to contain this info).
It doesn't sound too complex to implement the above design. In fact,
that's the way this btl used to work, no? There are large customers
requesting OMPI over cxgb3 and we're ready to provide the effort to get
this done. So I request we come to an agreement on how to support this
device efficiently (and for ompi-1.3).
On the single-QP angle, Can I just run OMPI with only specifying 1 QP?
Or will that require coding changes?