Sorry for jumping in late; the holiday and other travel prevented me
from getting to all my mail recently... :-\
Have you checked the counters on the subnet manager to see if any
other errors are occurring? It might be good to clear all the
counters, run the job, and see if the counters are increasing faster
than they should (i.e., any particular counter should advance very
very slowly -- perhaps 1 per day or so).
I'll ask around the kernel-level guys (i.e., Roland) to see what else
could cause this kind of error.
On Nov 27, 2007, at 3:35 PM, Brock Palen wrote:
> Ok i will open a case with cisco,
> Brock Palen
> Center for Advanced Computing
> On Nov 27, 2007, at 4:19 PM, Andrew Friedley wrote:
>> Brock Palen wrote:
>>>>> What would be a place to look? Should this just be default then
>>>>> OMPI? ompi_info shows the default as 10 seconds? Is that right
>>>>> 'seconds' ?
>>>> The other IB guys can probably answer better than I can -- I'm
>>>> not an
>>>> expert in this part of IB (or really any part I guess :). Not sure
>>>> a larger value isn't the default. No, its not seconds -- check the
>>>> description of the MCA parameter:
>>>> 4.096 microseconds * (2^btl_openib_ib_timeout)
>>> You sure?
>>> ompi_info --param btl openib
>>> MCA btl: parameter "btl_openib_ib_timeout" (current value: "10")
>>> InfiniBand transmit timeout, in seconds
>>> (must be >= 1)
>> MCA btl: parameter "btl_openib_ib_timeout" (current value: "10")
>> InfiniBand transmit timeout, plugged into formula:
>> 4.096 microseconds * (2^btl_openib_ib_timeout)(must be
>>> = 0 and <= 31)
>> Reading earlier in the thread you said OMPI v1.2.0, I got this from a
>> trunk checkout thats around 3 weeks old. A quick check shows this
>> description was changed between 1.2.0 and 1.2.1. However the use of
>> this parameter hasn't changed -- it's simply passed along to IB verbs
>> when creating a queue pair (aka a connection).
>> users mailing list
> users mailing list