Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] RETRY EXCEEDED ERROR
From: Pavel Shamis (Pasha) (pashash_at_[hidden])
Date: 2009-03-05 16:58:17


> Thanks Pasha!
> ibdiagnet reports the following:
>
> -I---------------------------------------------------
> -I- IPoIB Subnets Check
> -I---------------------------------------------------
> -I- Subnet: IPv4 PKey:0x7fff QKey:0x00000b1b MTU:2048Byte rate:10Gbps SL:0x00
> -W- Port localhost/P1 lid=0x00e2 guid=0x001e0bffff4ced75 dev=25218 can not join
> due to rate:2.5Gbps < group:10Gbps
>
> I guess this may indicate a bad adapter. Now, I just need to find what
> system this maps to.
>
I guess it is some bad cable....
> I also ran ibcheckerrors and it reports a lot of problems with buffer
> overruns. Here's the tail end of the output, with only some of the last
> ports reported:
>
> #warn: counter SymbolErrors = 36905 (threshold 10) lid 193 port 14
> #warn: counter LinkDowned = 23 (threshold 10) lid 193 port 14
> #warn: counter RcvErrors = 15641 (threshold 10) lid 193 port 14
> #warn: counter RcvSwRelayErrors = 225 (threshold 100) lid 193 port 14
> #warn: counter ExcBufOverrunErrors = 10 (threshold 10) lid 193 port 14
> Error check on lid 193 (ISR9288/ISR9096 Voltaire sLB-24) port 14: FAILED
> #warn: counter LinkRecovers = 181 (threshold 10) lid 193 port 1
> #warn: counter RcvSwRelayErrors = 2417 (threshold 100) lid 193 port 1
> Error check on lid 193 (ISR9288/ISR9096 Voltaire sLB-24) port 1: FAILED
> #warn: counter LinkRecovers = 103 (threshold 10) lid 193 port 3
> #warn: counter RcvErrors = 9035 (threshold 10) lid 193 port 3
> #warn: counter RcvSwRelayErrors = 64670 (threshold 100) lid 193 port 3
> Error check on lid 193 (ISR9288/ISR9096 Voltaire sLB-24) port 3: FAILED
> #warn: counter SymbolErrors = 13151 (threshold 10) lid 193 port 4
> #warn: counter RcvErrors = 109 (threshold 10) lid 193 port 4
> #warn: counter RcvSwRelayErrors = 507 (threshold 100) lid 193 port 4
> Error check on lid 193 (ISR9288/ISR9096 Voltaire sLB-24) port 4: FAILED
>
> ## Summary: 209 nodes checked, 0 bad nodes found
> ## 716 ports checked, 103 ports have errors beyond threshold
>
>
It reports a lot of symbol errors. I recommend you to reset all these
counters (if i remember correct it is
-c flag in ibdiagnet) and rerun the testing again after the mpi process
failure.

Thanks,
Pasha