I have recently seen some OpenIB time out errors and see the
* btl_openib_ib_retry_count - The number of times the sender will
attempt to retry (defaulted to 7, the maximum value).
* btl_openib_ib_timeout - The local ACK timeout parameter (defaulted
to 10). The actual timeout value used is calculated as:
I'd like to confirm that, when those messages say "defaulted to",
they are telling me what's happening on the node in question and
not just what the default is.
Reason for asking is that I believe that I am setting the values of
btl_openib_ib_timeout to 20, globally, as suggested in areas of the
OpenMPI docs but those messages, if they do report what's happening,
might be telling me otherwise.
In case it is relevant, the OpenMPI in question is the bog standard
Kevin M. Buckley Room: CO327
School of Engineering and Phone: +64 4 463 5971
Victoria University of Wellington