Please see below.
> When using XRC queues, Open MPI is indeed creating only one XRC queue
> per node (instead of per-host). The problem is that the number of send
> elements in this queue is multiplied by the number of processes on the
> remote host.
> So, what are we getting from this ? Not much, except that we can
> reduce the sd_max parameter to 1 element, and still have 8 elements in
> the send queue (on 8 cores machines), which may still be ok on the
> performance side.
Don't forget the the QP object itself consume some memory on
BUT , but I agree that we need to provide more flexibility and it will
be nice that default multiply coefficient will be smaller , as well I
think we need to make it user tunable parameter (yep, one more parameter).
> Send queues are created lazily, so having a lot of memory for send
> queues is not necessary blocking. What's
> blocking is the receive queues, because they are created during
> MPI_Init, so in a way, they are the "basic fare" of MPI.
BTW SRQ resources are also allocated on demand. We start with very small
SRQ and it is increased on SRQ limit event.
> The XRC protocol seems to create shared receive queues, which is a
> good thing. However, comparing memory used by an "X" queue versus and
> "S" queue, we can see a large difference. Digging a bit into the code,
> we found some
So, do you see that X consumes more that S ? This is really odd.
> strange things, like the completion queue size not being the same as
> "S" queues (the patch below would fix it, but the root of the problem
> may be elsewhere).
> Is anyone able to comment on this ?
The fix looks ok, please submit it to trunk.
BTW do you want to prepare the patch for send queue size factor ? It
should be quite simple.