Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] OMPI OpenIB Credit Schema breaks Chelsio HW
From: Gleb Natapov (glebn_at_[hidden])
Date: 2008-03-09 16:39:51

On Sun, Mar 09, 2008 at 02:48:09PM -0500, Jon Mason wrote:
> Issue (as described by Steve Wise):
> Currently OMPI uses qp 0 for all credit updates (by design). This breaks
> when running over the chelsio rnic due to a race condition between
> advertising the availability of a buffer using qp0 when the buffer was
> posted on one of the other qps. It is possible (and easily reproducible)
> that the peer gets the advertisement and sends data into the qp in question
> _before_ the rnic has processed the recv buffer and made it available for
> placement. This results in a connection termination. BTW, other hca's
> have this issue too. ehca, for example, claims they have the same race
> condition. I think the timing hole is much smaller though for devices that
> have 2 separate work queues for the SQ and RQ of a QP. Chelsio has a
> single work queue to implement both SQ and RQ, so processing of RQ work
> requests gets queued up behind pending SQ entries which can make this race
> condition more prevalent.
There was a discussion about this on openfabrics mailing list and the
conclusion was that what Open MPI does is correct according to IB/iWarp

> I don't know of any way to avoid this issue other that to ensure that all
> credit updates for qp X are posted only on qp X. If we do this, then the
> chelsio HW/FW ensures that the RECV is posted before the subsequent send
> operation that advertises the buffer is processed.
Is it possible to fix your FW to follow iWarp spec? Perhaps it is
possible to implement ibv_post_recv() so that it will not return before
post receive is processed?

> To address this Jeff Squyres recommends:
> 1. make an mca parameter that governs this behavior (i.e., whether to send
> all flow control messages on QP0 or on their respective QPs)
> 2. extend the ini file parsing code to accept this parameter as well (need
> to add a strcmp or two)
> 3. extend the ini file to fill in this value for all the nic's listed (to
> include yours).
> 4. extend the logic in the rest of the btl to send the flow control
> messages either across qp0 or the respective qp, depending on the value of
> the mca param / ini value.
> I am happy to do the work to enable this, but I would like to get
> everyone's feed back before I start down this path. Jeff said Gleb did
> the work to change openib to behave this way, so any insight would be
> helpful.
I personally don't like the idea to add another layer of complexity to openib
BTL code just to work around HW that doesn't follow spec. If work around
is simple that is OK, but in this case it is not so simple and will add
code path that is rarely tested. A simple workaround for the problem may
be to not configure multiple QPs if HW has a bug (and we can extend ini
file to contain this info).