Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] GM + OpenMPI bug ...
From: Patrick Geoffray (patrick_at_[hidden])
Date: 2010-05-20 10:29:27

Hi Jose,

On 5/12/2010 10:57 PM, Jos? Ignacio Aliaga Estell?s wrote:
> I think that I have found a bug on the implementation of GM collectives
> routines included in OpenMPI. The version of the GM software is 2.0.30
> for the PCI64 cards.

> I obtain the same problems when I use the 1.4.1 or the 1.4.2 version.
> Could you help me? Thanks.

We have been running the test you provided on 8 nodes for 4 hours and
haven't seen any errors. The setup used GM 2.0.30 and openmpi 1.4.2 on
PCI-X cards (M3F-PCIXD-2 aka 'D' cards). We do not have PCI64 NICs
anymore, and no machines with a PCI 64/66 slot.

One-bit errors are rarely a software problem, they are usually linked to
hardware corruption. Old PCI has a simple parity check but most
machines/BIOS of this era ignored reported errors. You may want to check
the lspci output on your machines and see if SERR or PERR is set. You
can also try to reset each NIC in its PCI slot, or use a different slot
if available.

Hope it helps.


Patrick Geoffray
Myricom, Inc.