Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI devel] OMPI OpenIB Credit Schema breaks Chelsio HW
From: Steve Wise (swise_at_[hidden])
Date: 2008-03-10 10:57:55

Jeff Squyres wrote:
> On Mar 9, 2008, at 3:39 PM, Gleb Natapov wrote:
>> 1. There was a discussion about this on openfabrics mailing list and
>> the
>> conclusion was that what Open MPI does is correct according to IB/
>> iWarp
>> spec.
>> 2. Is it possible to fix your FW to follow iWarp spec? Perhaps it is
>> possible to implement ibv_post_recv() so that it will not return
>> before
>> post receive is processed?
>> 3. I personally don't like the idea to add another layer of
>> complexity to openib
>> BTL code just to work around HW that doesn't follow spec. If work
>> around
>> is simple that is OK, but in this case it is not so simple and will
>> add
>> code path that is rarely tested. A simple workaround for the problem
>> may
>> be to not configure multiple QPs if HW has a bug (and we can extend
>> ini
>> file to contain this info).
> These are all valid points.
> In thinking about Gleb's proposal a bit more (extend the INI file
> syntax to accept per-HCA receive_queues values), it might be only
> somewhat less efficient (and a lot less code) than sending all flow
> control messages on the respective qp's anyway. So let's explore the
> math...
> The "let's use multiple QP's for short messages" scheme (a.k.a. BSRQ)
> was invented to get better registered memory utilization. Pushing all
> the FC messages down to the QP with the smallest buffer size was a
> desirable side-effect that made registered memory utilization even
> better (because short FC messages were naturally on the QP with the
> smallest buffer size). Specifically, today in openib/IB (SVN trunk),
> here's the default queue layout:
> pp: 256 buffers of size 128
> srq: 256 buffers of size 4k
> srq: 256 buffers of size 12k (eager limit)
> srq: 256 buffers of size 64k (max send size)
> And then we add 4 more buffers on the pp qp for flow control messages
> (since we only currently send FC messages for pp qp's). Total
> registered memory for a job with 1 remote peer: (256+4)*128 + 256*4k +
> 256*12k + 256*64k = ~20M. This is somewhat deceiving, because the
> total registered memory scales slowly with the number of procs in the
> job (e.g., with 2 remote peers, in only increases by 33k because we're
> using srq's).
> With Gleb's proposals, you'd only have one pp qp, assumedly 64k (or
> whatever the max send size is):
> pp: 256 buffers of size 64k (max send size)
> And then add 4 more for flow control messages. So total registered
> memory for a job with 1 remote peer: (256+4)*64k = ~17M. But that
> figure is approximately a per-peer cost -- so a job with 2 remote
> peers would use ~34M of registered memory, etc. This will [obviously]
> scale extremely poorly (and is one of the reasons that BSRQ exists).
> However, I wonder if there's a compromise (assuming you can't fix
> ibv_post_recv() to not return until the buffers are actually
> available, which, I agree with Gleb, seems like the best fix). Since
> we only use FC messages on pp qp's, why not make a "you can only have
> 1 pp qp and it must be qp 0" restriction for the Chelsio RNIC? This
> fits nicely into our default receive_queues value, anyway. That way,
> all FC messages will naturally go over qp 0 anyway (since that will be
> the only pp qp). Then, the only problem you have to solve is sending
> the *initial* credits message at wireup time (to know when the receive
> buffers have actually been posted to the srq's). Perhaps something
> like this:
> 1. you can export an attribute from the RNIC that advertises that
> ibv_post_recv() works this way (so that OMPI can detect it at run time)
> 2. hide the extra wireup / initial credit coordination in the rdma cpc
> when this attribute is detected (or make an mca param / ini file param
> that specifically requests for this extra rdma cm cpc behavior (or not).
> What would make this proposal moot is if the Chelsio RNIC can't do
> SRQs (I don't remember offhand). If it can't (and you can't fix
> ibv_post_recv()), then you might as well do Gleb's "just use one qp"
> proposal. You'll get lousy registered memory utilization, but the
> bigger problem you'll have is the scalability issues for large-peer-
> count jobs (e.g., using the values above, 17M of registered memory per
> peer; I assume you'll have to tune that down via .ini file params).
> What about that?
This gen of the chelsio rnic doesn't support SRQs.

I don't think we can fix post_recv to behave like we want.

A single PP QP might be fine for now, and chelsio's next-gen part will
support SRQs and not have this funky issue.

But why use such a large buffer size for a single PP QP? Why not use
something around 16KB?