Hi all,
We've been working trying to track down an IB issue here where a
user was having code (Gromacs, run with OMPI 1.4.5) dieing with:
[[18115,1],2][btl_openib_component.c:3224:handle_wc] from bruce030 to: bruce130 error polling LP CQ with status
RETRY EXCEEDED ERROR status number 12 for wr_id 7406080 opcode 0 vendor error 129 qp_idx 2
The odd thing I've spotted though is that in the error it says:
* btl_openib_ib_retry_count - The number of times the sender will attempt to retry (defaulted to 7, the maximum
value).
* btl_openib_ib_timeout - The local ACK timeout parameter (defaulted to 10).
Those don't match the values compiled into OMPI 1.4.5:
ompi_info -a | egrep 'btl_openib_ib_min_rnr_timer|btl_openib_ib_timeout'
MCA btl: parameter "btl_openib_ib_min_rnr_timer" (current value: "25",
data source: default value)
MCA btl: parameter "btl_openib_ib_timeout" (current value: "20", data
source: default value)
It looks like the file:
ompi/mca/btl/openib/help-mpi-btl-openib.txt
needs to be updated with the correct values.
We're stuck on 1.4 for the forseable future (too many apps to
recompile) so I don't know if 1.5+ has the same issue.
It's been there since at least 2009.. :-)
http://www.open-mpi.org/community/lists/users/2009/03/8600.php
cheers!
Chris
--
Christopher Samuel - Senior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: samuel_at_[hidden] Phone: +61 (0)3 903 55545
http://www.vlsci.unimelb.edu.au/
|