On Mar 9, 2008, at 3:39 PM, Gleb Natapov wrote:
> 1. There was a discussion about this on openfabrics mailing list and
> conclusion was that what Open MPI does is correct according to IB/
> 2. Is it possible to fix your FW to follow iWarp spec? Perhaps it is
> possible to implement ibv_post_recv() so that it will not return
> post receive is processed?
> 3. I personally don't like the idea to add another layer of
> complexity to openib
> BTL code just to work around HW that doesn't follow spec. If work
> is simple that is OK, but in this case it is not so simple and will
> code path that is rarely tested. A simple workaround for the problem
> be to not configure multiple QPs if HW has a bug (and we can extend
> file to contain this info).
These are all valid points.
In thinking about Gleb's proposal a bit more (extend the INI file
syntax to accept per-HCA receive_queues values), it might be only
somewhat less efficient (and a lot less code) than sending all flow
control messages on the respective qp's anyway. So let's explore the
The "let's use multiple QP's for short messages" scheme (a.k.a. BSRQ)
was invented to get better registered memory utilization. Pushing all
the FC messages down to the QP with the smallest buffer size was a
desirable side-effect that made registered memory utilization even
better (because short FC messages were naturally on the QP with the
smallest buffer size). Specifically, today in openib/IB (SVN trunk),
here's the default queue layout:
pp: 256 buffers of size 128
srq: 256 buffers of size 4k
srq: 256 buffers of size 12k (eager limit)
srq: 256 buffers of size 64k (max send size)
And then we add 4 more buffers on the pp qp for flow control messages
(since we only currently send FC messages for pp qp's). Total
registered memory for a job with 1 remote peer: (256+4)*128 + 256*4k +
256*12k + 256*64k = ~20M. This is somewhat deceiving, because the
total registered memory scales slowly with the number of procs in the
job (e.g., with 2 remote peers, in only increases by 33k because we're
With Gleb's proposals, you'd only have one pp qp, assumedly 64k (or
whatever the max send size is):
pp: 256 buffers of size 64k (max send size)
And then add 4 more for flow control messages. So total registered
memory for a job with 1 remote peer: (256+4)*64k = ~17M. But that
figure is approximately a per-peer cost -- so a job with 2 remote
peers would use ~34M of registered memory, etc. This will [obviously]
scale extremely poorly (and is one of the reasons that BSRQ exists).
However, I wonder if there's a compromise (assuming you can't fix
ibv_post_recv() to not return until the buffers are actually
available, which, I agree with Gleb, seems like the best fix). Since
we only use FC messages on pp qp's, why not make a "you can only have
1 pp qp and it must be qp 0" restriction for the Chelsio RNIC? This
fits nicely into our default receive_queues value, anyway. That way,
all FC messages will naturally go over qp 0 anyway (since that will be
the only pp qp). Then, the only problem you have to solve is sending
the *initial* credits message at wireup time (to know when the receive
buffers have actually been posted to the srq's). Perhaps something
1. you can export an attribute from the RNIC that advertises that
ibv_post_recv() works this way (so that OMPI can detect it at run time)
2. hide the extra wireup / initial credit coordination in the rdma cpc
when this attribute is detected (or make an mca param / ini file param
that specifically requests for this extra rdma cm cpc behavior (or not).
What would make this proposal moot is if the Chelsio RNIC can't do
SRQs (I don't remember offhand). If it can't (and you can't fix
ibv_post_recv()), then you might as well do Gleb's "just use one qp"
proposal. You'll get lousy registered memory utilization, but the
bigger problem you'll have is the scalability issues for large-peer-
count jobs (e.g., using the values above, 17M of registered memory per
peer; I assume you'll have to tune that down via .ini file params).
What about that?