On 5/12/2010 10:57 PM, Jos? Ignacio Aliaga Estell?s wrote:
> I think that I have found a bug on the implementation of GM collectives
> routines included in OpenMPI. The version of the GM software is 2.0.30
> for the PCI64 cards.
> I obtain the same problems when I use the 1.4.1 or the 1.4.2 version.
> Could you help me? Thanks.
We have been running the test you provided on 8 nodes for 4 hours and
haven't seen any errors. The setup used GM 2.0.30 and openmpi 1.4.2 on
PCI-X cards (M3F-PCIXD-2 aka 'D' cards). We do not have PCI64 NICs
anymore, and no machines with a PCI 64/66 slot.
One-bit errors are rarely a software problem, they are usually linked to
hardware corruption. Old PCI has a simple parity check but most
machines/BIOS of this era ignored reported errors. You may want to check
the lspci output on your machines and see if SERR or PERR is set. You
can also try to reset each NIC in its PCI slot, or use a different slot
Hope it helps.