> How does the stack for the non-SM BTL run look, I assume it probably is the same? Also, can you dump the message queues for rank 1? What's interesting is you have a bunch of pending receives, do you expect that to be the case when the MPI_Gatherv occurred?
It turns out we have an unbalanced MPI_Bcast buried very deep in the application. After fixing that bug, the application behaves correctly.
Thank you all for the help, and sorry for the false alarm.