Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] RETRY EXCEEDED ERROR
From: Pavel Shamis (Pasha) (pashash_at_[hidden])
Date: 2009-03-05 16:58:17


> Thanks Pasha!
> ibdiagnet reports the following:
>
> -I---------------------------------------------------
> -I- IPoIB Subnets Check
> -I---------------------------------------------------
> -I- Subnet: IPv4 PKey:0x7fff QKey:0x00000b1b MTU:2048Byte rate:10Gbps SL:0x00
> -W- Port localhost/P1 lid=0x00e2 guid=0x001e0bffff4ced75 dev=25218 can not join
> due to rate:2.5Gbps < group:10Gbps
>
> I guess this may indicate a bad adapter. Now, I just need to find what
> system this maps to.
>
I guess it is some bad cable....
> I also ran ibcheckerrors and it reports a lot of problems with buffer
> overruns. Here's the tail end of the output, with only some of the last
> ports reported:
>
> #warn: counter SymbolErrors = 36905 (threshold 10) lid 193 port 14
> #warn: counter LinkDowned = 23 (threshold 10) lid 193 port 14
> #warn: counter RcvErrors = 15641 (threshold 10) lid 193 port 14
> #warn: counter RcvSwRelayErrors = 225 (threshold 100) lid 193 port 14
> #warn: counter ExcBufOverrunErrors = 10 (threshold 10) lid 193 port 14
> Error check on lid 193 (ISR9288/ISR9096 Voltaire sLB-24) port 14: FAILED
> #warn: counter LinkRecovers = 181 (threshold 10) lid 193 port 1
> #warn: counter RcvSwRelayErrors = 2417 (threshold 100) lid 193 port 1
> Error check on lid 193 (ISR9288/ISR9096 Voltaire sLB-24) port 1: FAILED
> #warn: counter LinkRecovers = 103 (threshold 10) lid 193 port 3
> #warn: counter RcvErrors = 9035 (threshold 10) lid 193 port 3
> #warn: counter RcvSwRelayErrors = 64670 (threshold 100) lid 193 port 3
> Error check on lid 193 (ISR9288/ISR9096 Voltaire sLB-24) port 3: FAILED
> #warn: counter SymbolErrors = 13151 (threshold 10) lid 193 port 4
> #warn: counter RcvErrors = 109 (threshold 10) lid 193 port 4
> #warn: counter RcvSwRelayErrors = 507 (threshold 100) lid 193 port 4
> Error check on lid 193 (ISR9288/ISR9096 Voltaire sLB-24) port 4: FAILED
>
> ## Summary: 209 nodes checked, 0 bad nodes found
> ## 716 ports checked, 103 ports have errors beyond threshold
>
>
It reports a lot of symbol errors. I recommend you to reset all these
counters (if i remember correct it is
-c flag in ibdiagnet) and rerun the testing again after the mpi process
failure.

Thanks,
Pasha