Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OpenIB error messages: reporting the default or telling you what's happening?
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2011-09-14 05:59:07


On Sep 13, 2011, at 6:33 PM, Kevin.Buckley_at_[hidden] wrote:

> there have been two runs of jobs that invoked the mpirun using these
> OpenMPI parameter setting flags (basically, these mimic what I have
> in the global config file)
>
> -mca btl_openib_ib_timeout 20 -mca btl_openib_ib_min_rnr_timer 25
>
> when both of the job failed, the error output was
>
> -----8<----------8<----------8<----------8<----------8<-----
> Two MCA parameters can be used to control Open MPI's behavior with
> respect to the retry count:
>
> * btl_openib_ib_retry_count - The number of times the sender will
> attempt to retry (defaulted to 7, the maximum value).
> * btl_openib_ib_timeout - The local ACK timeout parameter (defaulted
> to 10). The actual timeout value used is calculated as:
> -----8<----------8<----------8<----------8<----------8<-----
>
> Note that the error output it still showing that mysterious "10"
> in there for btl_openib_ib_timeout value.

That text message is hard-coded (and apparently out of date); it does not show the current value.

I agree that that is misleading. This error message needs to be improved.

> I have noticed that the same node is apearing in the error output
> each time, so I'll try taking that one out of the test PE that the
> jobs are executing in and seeing if I can tie it to hardware.

This might suggest a hardware issue; let us know what you find.

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/