Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: [OMPI devel] OMPI OpenIB Credit Schema breaks Chelsio HW
From: Jon Mason (jon_at_[hidden])
Date: 2008-03-09 15:48:09

After discussing this issue with Jeff via private e-mails. I would like
to open the issue to the group for futher discussion.

Issue (as described by Steve Wise):

Currently OMPI uses qp 0 for all credit updates (by design). This breaks
when running over the chelsio rnic due to a race condition between
advertising the availability of a buffer using qp0 when the buffer was
posted on one of the other qps. It is possible (and easily reproducible)
that the peer gets the advertisement and sends data into the qp in question
_before_ the rnic has processed the recv buffer and made it available for
placement. This results in a connection termination. BTW, other hca's
have this issue too. ehca, for example, claims they have the same race
condition. I think the timing hole is much smaller though for devices that
have 2 separate work queues for the SQ and RQ of a QP. Chelsio has a
single work queue to implement both SQ and RQ, so processing of RQ work
requests gets queued up behind pending SQ entries which can make this race
condition more prevalent.

I don't know of any way to avoid this issue other that to ensure that all
credit updates for qp X are posted only on qp X. If we do this, then the
chelsio HW/FW ensures that the RECV is posted before the subsequent send
operation that advertises the buffer is processed.

To address this Jeff Squyres recommends:

1. make an mca parameter that governs this behavior (i.e., whether to send
all flow control messages on QP0 or on their respective QPs)

2. extend the ini file parsing code to accept this parameter as well (need
to add a strcmp or two)

3. extend the ini file to fill in this value for all the nic's listed (to
include yours).

4. extend the logic in the rest of the btl to send the flow control
messages either across qp0 or the respective qp, depending on the value of
the mca param / ini value.

I am happy to do the work to enable this, but I would like to get
everyone's feed back before I start down this path. Jeff said Gleb did
the work to change openib to behave this way, so any insight would be