On 5/21/2010 6:54 AM, José Ignacio Aliaga Estellés wrote:
> We have used the lspci -vvxxx and we have obtained:
> bi00: 04:01.0 Ethernet controller: Intel Corporation 82544EI Gigabit
> Ethernet Controller (Copper) (rev 02)
This is the output for the Intel GigE NIC, you should look at the one
for the Myricom NIC and the PCI bridge above it (lspci -t to see the tree).
> bi00: Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort-
> <TAbort- <MAbort- >SERR- <PERR-
PERR- status means no parity detected when receiving data. Looking at
the PERR status of the PCI bridge on the other side will show if there
was in corruption on that bus.
As a first step, you can see if you can reproduce errors with a simple
test involving a single node at a time. You can run "gm_allsize
--verify" on each machine: it will send packets to itself (loopback in
the switch) and check for corruption. If you don't see errors after a
while, that node is probably clean. If you see errors, you can look
deeper at lspci output to see if it's a PCI problem. If you are using a
riser card, you can try without.
I am not sure if openMPI has an option to enable debug checksum, but it
would also be useful to see if it detects anything.
> Additionally, if you know any software tool or methodology to check the
> hardware/software, please, could you send us how to do it?
You may want to look at the FAQ on GM troubleshooting:
Additionally, you can send email to help_at_[hidden] to open a ticket.