I found several reports on the openmpi users mailing list from users,
who need to bump up the default value for btl_openib_ib_timeout.
We also have some applications on our cluster, that have problems,
unless we set this value from the default 10 to 15:
[24426,1],122][btl_openib_component.c:2905:handle_wc] from shc174 to: shc175
error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for
wr_id 250450816 opcode 11048 qp_idx 3
This is seen with OpenMPI 1.3 and OpenFabrics 1.4.
Is this normal or is it an indicator of other problems, maybe related to
Are there other parameters that need to be looked at too?
Thanks for any insight on this!