Jeff Squyres wrote:
> On Mar 9, 2008, at 3:39 PM, Gleb Natapov wrote:
>> 1. There was a discussion about this on openfabrics mailing list and
>> conclusion was that what Open MPI does is correct according to IB/
>> 2. Is it possible to fix your FW to follow iWarp spec? Perhaps it is
>> possible to implement ibv_post_recv() so that it will not return
>> post receive is processed?
>> 3. I personally don't like the idea to add another layer of
>> complexity to openib
>> BTL code just to work around HW that doesn't follow spec. If work
>> is simple that is OK, but in this case it is not so simple and will
>> code path that is rarely tested. A simple workaround for the problem
>> be to not configure multiple QPs if HW has a bug (and we can extend
>> file to contain this info).
> These are all valid points.
> In thinking about Gleb's proposal a bit more (extend the INI file
> syntax to accept per-HCA receive_queues values), it might be only
> somewhat less efficient (and a lot less code) than sending all flow
> control messages on the respective qp's anyway. So let's explore the
> The "let's use multiple QP's for short messages" scheme (a.k.a. BSRQ)
> was invented to get better registered memory utilization. Pushing all
> the FC messages down to the QP with the smallest buffer size was a
> desirable side-effect that made registered memory utilization even
> better (because short FC messages were naturally on the QP with the
> smallest buffer size). Specifically, today in openib/IB (SVN trunk),
> here's the default queue layout:
> pp: 256 buffers of size 128
> srq: 256 buffers of size 4k
> srq: 256 buffers of size 12k (eager limit)
> srq: 256 buffers of size 64k (max send size)
> And then we add 4 more buffers on the pp qp for flow control messages
> (since we only currently send FC messages for pp qp's). Total
> registered memory for a job with 1 remote peer: (256+4)*128 + 256*4k +
> 256*12k + 256*64k = ~20M. This is somewhat deceiving, because the
> total registered memory scales slowly with the number of procs in the
> job (e.g., with 2 remote peers, in only increases by 33k because we're
> using srq's).
> With Gleb's proposals, you'd only have one pp qp, assumedly 64k (or
> whatever the max send size is):
> pp: 256 buffers of size 64k (max send size)
> And then add 4 more for flow control messages. So total registered
> memory for a job with 1 remote peer: (256+4)*64k = ~17M. But that
> figure is approximately a per-peer cost -- so a job with 2 remote
> peers would use ~34M of registered memory, etc. This will [obviously]
> scale extremely poorly (and is one of the reasons that BSRQ exists).
> However, I wonder if there's a compromise (assuming you can't fix
> ibv_post_recv() to not return until the buffers are actually
> available, which, I agree with Gleb, seems like the best fix). Since
> we only use FC messages on pp qp's, why not make a "you can only have
> 1 pp qp and it must be qp 0" restriction for the Chelsio RNIC? This
> fits nicely into our default receive_queues value, anyway. That way,
> all FC messages will naturally go over qp 0 anyway (since that will be
> the only pp qp). Then, the only problem you have to solve is sending
> the *initial* credits message at wireup time (to know when the receive
> buffers have actually been posted to the srq's). Perhaps something
> like this:
> 1. you can export an attribute from the RNIC that advertises that
> ibv_post_recv() works this way (so that OMPI can detect it at run time)
> 2. hide the extra wireup / initial credit coordination in the rdma cpc
> when this attribute is detected (or make an mca param / ini file param
> that specifically requests for this extra rdma cm cpc behavior (or not).
> What would make this proposal moot is if the Chelsio RNIC can't do
> SRQs (I don't remember offhand). If it can't (and you can't fix
> ibv_post_recv()), then you might as well do Gleb's "just use one qp"
> proposal. You'll get lousy registered memory utilization, but the
> bigger problem you'll have is the scalability issues for large-peer-
> count jobs (e.g., using the values above, 17M of registered memory per
> peer; I assume you'll have to tune that down via .ini file params).
> What about that?
This gen of the chelsio rnic doesn't support SRQs.
I don't think we can fix post_recv to behave like we want.
A single PP QP might be fine for now, and chelsio's next-gen part will
support SRQs and not have this funky issue.
But why use such a large buffer size for a single PP QP? Why not use
something around 16KB?