I'm having a problem with the UD BTL and hoping someone might have some
input to help solve it.
What I'm seeing is hangs when running alltoall benchmarks with nbcbench
or an LLNL program called mpiBench -- both hang exactly the same way.
With the code on the trunk running nbcbench on IU's odin using 32 nodes
and a command line like this:
mpirun -np 128 -mca btl ofud,self ./nbcbench -t MPI_Alltoall -p 128-128
hangs consistently when testing 256-byte messages. There are two things
I can do to make the hang go away until running at larger scale. First
is to increase the 'btl_ofud_sd_num' MCA param from its default value of
128. This allows you to run with more procs/nodes before hitting the
hang, but AFAICT doesn't fix the actual problem. What this parameter
does is control the maximum number of outstanding send WQEs posted at
the IB level -- when the limit is reached, frags are queued on an
opal_list_t and later sent by progress as IB sends complete.
The other way I've found is to play games with calling
mca_btl_ud_component_progress() in mca_btl_ud_endpoint_post_send(). In
fact I replaced the CHECK_FRAG_QUEUES() macro used around
btl_ofud_endpoint.c:77 with a version that loops on progress until a
send WQE slot is available (as opposed to queueing). Same result -- I
can run at larger scale, but still hit the hang eventually.
It appears that when the job hangs, progress is being polled very
quickly, and after spinning for a while there are no outstanding send
WQEs or queued sends in the BTL. I'm not sure where further up things
are spinning/blocking, as I can't produce the hang at less than 32 nodes
/ 128 procs and don't have a good way of debugging that (suggestions
Furthermore, both ob1 and dr PMLs result in the same behavior, except
that DR eventually trips a watchdog timeout, fails the BTL, and
terminates the job.
Other collectives such as allreduce and allgather do not hang -- only
alltoall. I can also reproduce the hang on LLNL's Atlas machine.
Can anyone else reproduce this (Torsten might have to make a copy of
nbcbench available)? Anyone have any ideas as to what's wrong?