Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] GM + OpenMPI bug ...
From: José Ignacio Aliaga Estellés (aliaga_at_[hidden])
Date: 2010-05-31 07:27:31


Hi,

We have made different tests to locate the problem. Some nodes don't
work correctly when we use gm_allsize -v and we have isolated them.
On the good nodes, we have executed our broadcast test with MPICH-1
and it works correctly. But If we use OpenMPI 1.4.2 it still fails.

We would like to active the parity error check, to test if this
option solves all our problems. But we don´t know how to do it.
Below, we attach you the output of the lspci command. We suppose that
this check errors is not enabled.

Best regards,

   José i. Aliaga

==================
$ /sbin/lspci -vvxxx
...
02:03.0 Network controller: MYRICOM Inc. Myrinet 2000 Scalable
Cluster Interconnect (rev 03)
         Subsystem: MYRICOM Inc. Myrinet 2000 Scalable Cluster
Interconnect
         Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop-
ParErr+ Stepping+ SERR+ FastB2B-
         Status: Cap- 66MHz+ UDF- FastB2B- ParErr- DEVSEL=slow
>TAbort- <TAbort- <MAbort- >SERR- <PERR-
         Latency: 64, Cache Line Size: 64 bytes
         Interrupt: pin A routed to IRQ 217
         Region 0: Memory at fb000000 (32-bit, prefetchable) [size=16M]
         Expansion ROM at fce80000 [disabled] [size=512K]
00: c1 14 43 80 d6 01 20 04 03 00 80 02 10 40 00 00
10: 08 00 00 fb 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 c1 14 43 80
30: 00 00 e8 fc 00 00 00 00 00 00 00 00 0a 01 00 00
...

El 21/05/2010, a las 19:57, Patrick Geoffray escribió:

> Hi Jose,
>
> On 5/21/2010 6:54 AM, José Ignacio Aliaga Estellés wrote:
>> We have used the lspci -vvxxx and we have obtained:
>>
>> bi00: 04:01.0 Ethernet controller: Intel Corporation 82544EI Gigabit
>> Ethernet Controller (Copper) (rev 02)
>
> This is the output for the Intel GigE NIC, you should look at the
> one for the Myricom NIC and the PCI bridge above it (lspci -t to
> see the tree).
>
>> bi00: Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium
>> >TAbort-
>> <TAbort- <MAbort- >SERR- <PERR-
>
> PERR- status means no parity detected when receiving data. Looking
> at the PERR status of the PCI bridge on the other side will show if
> there was in corruption on that bus.
>
> As a first step, you can see if you can reproduce errors with a
> simple test involving a single node at a time. You can run
> "gm_allsize --verify" on each machine: it will send packets to
> itself (loopback in the switch) and check for corruption. If you
> don't see errors after a while, that node is probably clean. If you
> see errors, you can look deeper at lspci output to see if it's a
> PCI problem. If you are using a riser card, you can try without.
>
> I am not sure if openMPI has an option to enable debug checksum,
> but it would also be useful to see if it detects anything.
>
>> Additionally, if you know any software tool or methodology to
>> check the
>> hardware/software, please, could you send us how to do it?
>
> You may want to look at the FAQ on GM troubleshooting:
> http://www.myri.com/cgi-bin/fom.pl?file=425
>
> Additionally, you can send email to help_at_[hidden] to open a ticket.
>
> Patrick
>