The osu_bibw micro-benchmark from Ohio State's OMB 3.1 suite hangs when
run over OpenMPI 1.2.5 from OFED 1.3 using the OpenIB BTL if there is
insufficient lockable memory. 128MB of lockable memory gives a hang
when the test gets to 4MB messages, while 512MB is sufficient for it
to pass. I observed this with InfiniPath and Mellanox adapter cards,
and see the same behavior with 1.2.6 too. I know the general advice
is to use an unlimited or very large setting (per the FAQ), but there
are reasons for clusters to set finite user limits.

For each message size in the loop, osu_bibw posts 64 non-blocking
sends followed by 64 non-blocking receives on both ranks followed
by a wait for them all to complete. 64 is the default value for
the window size (number of concurrent messages). For 4MB messages
this is 256MB of memory to be sent which is more than exhausting
the 128MB of lockable memory on these systems. The OpenIB BTL
does ib_reg_mr for as many of the sends as it can and the rest
wait on a pending list. Then the ib_reg_mr for all the
posted receives all fail as well due to the ulimit check,
and all of them have to wait on a pending list too. This means
that neither rank actually gets to do an ib_post_recv, neither
side can make progress and the benchmark hangs without completing
a single 4MB message! This contrasts with the uni-directional
osu_bw where one side does sends and the other does receives
and progress can be made.

This is admittedly a hard problem to solve in the general case.
It is unfortunate that this leads to a hang, rather than a
message advising the user to check ulimits. Maybe there should
be a warning the first time that the ulimit is exceeded to
alert the user to the problem. One solution would be to divide
the ulimit up into separate limits for sending and receiving,
so that excessive sending does not block all receiving. This
would require OpenMPI to keep track of the ulimit usage
separately for send and receive.

In this particular synthetic benchmark there turns out to be
a straightforward workaround. The benchmark actually sends
from the same buffer 64 times over, and receives into another
buffer 64 times over (all posted concurrently). Thus there are
really only two 4MB buffers at play here, though the kernel IB
code charges the user separately for all 64 registrations of
each even though the user already has those pages locked. In fact,
the linux implementation of mlock (over)charges in the same way
so I guess that choice is intentional and that the additional
complexity in spotting the duplicated locked pages wasn't

This leads to the workaround of using --mca mpi_leave_pinned 1.
This turns on the code in the OpenIB BTL that caches the descriptors
so that there is only 1 ib_reg_mr for the send buffer and 1 ib_reg_mr
for the receive buffer, and all the others hit the descriptor
cache. This saves the day and the benchmark runs without problem.

If this was the default option then this might save much consternation
for the user. For this workaround, note that there isn't any need
for the descriptors to be left pinned after the send/recv complete,
all that is needed is the caching while they are posted. So one could
default to having the descriptor caching mechanism enabled even when
mpi_leave_pinned is off. Also note that this is still a workaround
that happens to be sufficient for the osu_bibw case but isn't a
general panacea. osu_bibw and osu_bw are "broken" anyway in that
it is illegal to post multiple concurrent receives in the same
receive buffer. I believe this is done to minimize CPU cache
effects and maximize measured bandwidth. Anyway, having multiple
posted sends from the same send buffer is reasonable (eg. a broadcast)
so caching those descriptors and reducing lockable memory usage
seems like a good idea to me. Although osu_bibw is very synthetic
it is conceivable that other real codes with large messages could
see the hangs (eg. just MPI_Sendrecv a message larger than ulimit -l?).