On Sep 13, 2011, at 6:33 PM, Kevin.Buckley_at_[hidden] wrote:
> there have been two runs of jobs that invoked the mpirun using these
> OpenMPI parameter setting flags (basically, these mimic what I have
> in the global config file)
>
> -mca btl_openib_ib_timeout 20 -mca btl_openib_ib_min_rnr_timer 25
>
> when both of the job failed, the error output was
>
> -----8<----------8<----------8<----------8<----------8<-----
> Two MCA parameters can be used to control Open MPI's behavior with
> respect to the retry count:
>
> * btl_openib_ib_retry_count - The number of times the sender will
> attempt to retry (defaulted to 7, the maximum value).
> * btl_openib_ib_timeout - The local ACK timeout parameter (defaulted
> to 10). The actual timeout value used is calculated as:
> -----8<----------8<----------8<----------8<----------8<-----
>
> Note that the error output it still showing that mysterious "10"
> in there for btl_openib_ib_timeout value.
That text message is hard-coded (and apparently out of date); it does not show the current value.
I agree that that is misleading. This error message needs to be improved.
> I have noticed that the same node is apearing in the error output
> each time, so I'll try taking that one out of the test PE that the
> jobs are executing in and seeing if I can tie it to hardware.
This might suggest a hardware issue; let us know what you find.
--
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/
|