Galen and I were looking at the default value for
btl_openib_receive_queues value today and noticed that the first P
value seems to be quite low (i.e., they're what Galen used to *force*
flow control while he was debugging the new protocol). We propose
changing it to:
With these numbers, if you have a fully-OFA-connected 512 process
MPI_COMM_WORLD, each process will take up a little over 65MB of
buffering space per MPI process.
The 128 byte buffers are so small that we can have a lot more of
them, but we want to have a repost value high enough (128) to allow
at least "1 wire full" of messages (plus some extra to account for
credit processing, etc.), and have aggressing credit acking (16) to
possibly allow not stalling the sender for a lazy receiver (see full
explanation of these parameters below).
For hardware that does not support SRQ (eHCA v1, iWARP, ...?), we
propose the following values:
For a fully-connected 512 process MPI_COMM_WORLD, this will consume
~385MB of buffering per MPI process. Of course, if you take off the
64k QP and simply make "long" messages shorter, you're down to 128MB
per process which is potentially a bit more manageable (note that we
currently do not have a way to automatically change PML values based
on the hardware found in the host). These values should probably be
discussed in detail by the vendors who do not support SRQ to decide
what they want.
It's still an open question as to what the mechanism should be to
determine which of these two strings should be used; it's likely to
be something like this:
1. if the user specifies a string (MCA param), use it
2. probe the HW at run time; if the hardware supports SRQ, use the
SRQ string (eHCA v1 and v2 supports the attribute -- I don't know if
iWARP cards do...?)
3. use the non-SRQ string
I've attached a spreadsheet for those who are interested to help
explore different parameter value sets. The top block is the 1 per-
peer QP + 3 SRQ case; the bottom block is the 4 per-peer QP case.
Feel free to modify as you want; enjoy.
To explain all these numbers, here's a first cut at a writeup what
they mean (think of this as preliminary FAQ fodder -- it's likely to
be modified a bit before it hits the FAQ). Remember that this is
OMPI trunk only (i.e., 1.3 series -- not 1.2 series).
btl_openib_receive_queues allows the specification of multiple receive
queues for OpenFabrics networks. Each queue is designated by its type
followed by a series of numeric parameters.
Queues can be one of two types:
- Per-peer (P), meaning that each queue is dedicated to receiving
messages from a single, specific peer MPI process. Buffers to
receive incoming messages from the peer are guaranteed through
explicit flow control by Open MPI (i.e., OpenFabrics network-level
retransmissions due to "receiver not ready" (RNR) errors will never
- Shared receive queue (S), meaning that a receive queue is shared
between all MPI sending processes. Buffers to receive incoming
messages from all peers are not necessarily guaranteed because no
flow control is possible if less than (num_peers*num_buffers_each)
buffers are available in the shared receive queue (which is
typically a goal of using SRQ: providing less than N*M buffers).
Shared receive queues can be faster than per-peer queues because of
the lack of explicit flow control traffic, but OpenFabrics
network-level retransmission errors can occur if multiple senders
combine to overflow the shared receive queue's available receive
Per-peer queues are specified in the following form:
- <size>: The size of receive buffers to be posted in this queue (in
- <num_buffers>: The maximum number of buffers to post to this queue
for incoming MPI message fragments.
- <low_watermark>: An optional parameter specifying the number of
available buffers left on the queue before Open MPI will re-post
buffers up to <num_buffers>. Note that as a latency reduction
mechanism, Open MPI does not re-post a receive buffer as soon as it
becomes available (because it is expensive to do so). Instead, Open
MPI waits until several receive buffers become available again and
then posts them all at once. If not specified, <low_watermark>
defaults to <num_buffers>/2.
- <window_size>: An optional parameter specifying the number of ACKs
to accumulate before sending an explicit ACK control message back to
a peer. ACKs are typically piggybacked on outgoing messages to a
peer; they are grouped into explicit control messages only where
they are no other outgoing messages to a peer. If not specified,
<window_size> defaults to <low_watermark>/2.
- <reserved>: An optional parameter specifying the number of receive
buffers to post to the queue that are specifically used for incoming
ACK control messages (vs. incoming MPI messages). If unspecified,
<reserved> defaults to ((<num_buffers>*2)-1)/<window_size>. Note
that control messages use their own flow control (separate from the
flow control for MPI message fragments); explicit control messages
are always ACK'ed via piggyback data on other messages to ensure
that control messages will not trigger RNR errors.
Specifies a per-peer receive queue that initially posts 16 buffers,
each of size 128 bytes. When there are 4 buffers left on the receive
queue, Open MPI will re-post 124 buffers to the queue, restoring it to
having a total of 128 buffers available for incoming messages.
Explicit ACK control messages will be sent back for every 2 incoming
messages (if not already piggybacked on other outgoing messages). 127
buffers are reserved for ACK control messages.
Shared queues are specified in the following form:
- <size>: Same as for per-peer queues.
- <num_buffers>: Same as for per-peer queues.
- <low_watermark>: Same as for per-peer queues.
- <max_pending_sends>: An optional parameter that specifies the number
of outstanding sends that are allowed at a given time on the queue.
This provides a "good enough" mechanism of flow control for some
regular communication patterns. If not specified,
<max_pending_sends> defaults to <low_watermark>/4.
Specifies a shared receive queue that posts 256 buffers, each of size
1024 bytes. When there are 128 buffers left on the receive queue,
Open MPI will re-post 128 buffers to the queue, restoring it to having
a total of 256 buffers available for incoming messages. A maximum of
32 non-local-completed messages are allowed to be pending to a peer at
any given time.
Note that queues MUST be specified in ascending receive buffer size
order. This requirement may be removed prior to 1.3 release.