Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] GM + OpenMPI bug ...
From: José Ignacio Aliaga Estellés (aliaga_at_[hidden])
Date: 2010-05-21 06:54:24


Hi,

We have used the lspci -vvxxx and we have obtained:

bi00: 04:01.0 Ethernet controller: Intel Corporation 82544EI Gigabit
Ethernet Controller (Copper) (rev 02)
bi00: Subsystem: Intel Corporation PRO/1000 XT Server Adapter
bi00: Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop-
ParErr- Stepping- SERR+ FastB2B-
bi00: Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium
>TAbort- <TAbort- <MAbort- >SERR- <PERR-
bi00: Latency: 64 (63750ns min), Cache Line Size: 64 bytes
bi00: Interrupt: pin A routed to IRQ 185
bi00: Region 0: Memory at fe9e0000 (64-bit, non-prefetchable)
[size=128K]
bi00: Region 2: Memory at fe9d0000 (64-bit, non-prefetchable)
[size=64K]
bi00: Region 4: I/O ports at dc80 [size=32]
bi00: Expansion ROM at fe9c0000 [disabled] [size=64K]
bi00: Capabilities: [dc] Power Management version 2
bi00: Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0
+,D1-,D2-,D3hot+,D3cold-)
bi00: Status: D0 PME-Enable- DSel=0 DScale=1 PME-
bi00: Capabilities: [e4] PCI-X non-bridge device
bi00: Command: DPERE- ERO+ RBC=512 OST=1
bi00: Status: Dev=04:01.0 64bit+ 133MHz+ SCD- USC- DC=simple
DMMRBC=2048 DMOST=1 DMCRS=16 RSCEM- 266MHz- 533MHz-
bi00: Capabilities: [f0] Message Signalled Interrupts: 64bit+
Queue=0/0 Enable-
bi00: Address: 0000000000000000 Data: 0000
bi00: 00: 86 80 08 10 17 01 30 02 02 00 00 02 10 40 00 00
bi00: 10: 04 00 9e fe 00 00 00 00 04 00 9d fe 00 00 00 00
bi00: 20: 81 dc 00 00 00 00 00 00 00 00 00 00 86 80 07 11
bi00: 30: 00 00 9c fe dc 00 00 00 00 00 00 00 05 01 ff 00
bi00: 40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
bi00: 50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
bi00: 60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
bi00: 70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
bi00: 80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
bi00: 90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
bi00: a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
bi00: b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
bi00: c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
bi00: d0: 00 00 00 00 00 00 00 00 00 00 00 00 01 e4 22 48
bi00: e0: 00 20 00 40 07 f0 02 00 08 04 43 04 00 00 00 00
bi00: f0: 05 00 80 00 00 00 00 00 00 00 00 00 00 00 00 00

We don't know how to interpret this information. We suppose that SEER
and PERR are not activated, if we have understood correctly the
Status " ... >SERR- <PERR-".
Could you confirm that? If this is the case, could you indicate how
to activate them?

Additionally, if you know any software tool or methodology to check
the hardware/software, please, could you send us how to do it?

Thanks in advance.

Best regards,

   José I. Aliaga

El 20/05/2010, a las 16:29, Patrick Geoffray escribió:

> Hi Jose,
>
> On 5/12/2010 10:57 PM, Jos? Ignacio Aliaga Estell?s wrote:
>> I think that I have found a bug on the implementation of GM
>> collectives
>> routines included in OpenMPI. The version of the GM software is
>> 2.0.30
>> for the PCI64 cards.
>
>> I obtain the same problems when I use the 1.4.1 or the 1.4.2 version.
>> Could you help me? Thanks.
>
> We have been running the test you provided on 8 nodes for 4 hours
> and haven't seen any errors. The setup used GM 2.0.30 and openmpi
> 1.4.2 on PCI-X cards (M3F-PCIXD-2 aka 'D' cards). We do not have
> PCI64 NICs anymore, and no machines with a PCI 64/66 slot.
>
> One-bit errors are rarely a software problem, they are usually
> linked to hardware corruption. Old PCI has a simple parity check
> but most machines/BIOS of this era ignored reported errors. You may
> want to check the lspci output on your machines and see if SERR or
> PERR is set. You can also try to reset each NIC in its PCI slot, or
> use a different slot if available.
>
> Hope it helps.
>
> Patrick
> --
> Patrick Geoffray
> Myricom, Inc.
> http://www.myri.com
>