Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] MPI_ALLREDUCE: Segmentation Fault
From: Timothy Stitt (Timothy.Stitt.9_at_[hidden])
Date: 2011-06-02 15:27:41


Hi all,

We have a code built with OpenMPI (v1.4.3) and the Intel v12.0 compiler that has been tested successfully on 10s - 100s of cores on our cluster. We recently ran the same code with 1020 cores and received the following runtime error:

> [d6cneh042:28543] *** Process received signal ***
> [d6cneh061:29839] Signal: Segmentation fault (11)
> [d6cneh061:29839] Signal code: Address not mapped (1)
> [d6cneh061:29839] Failing at address: 0x10
> [d6cneh030:26800] Signal: Segmentation fault (11)
> [d6cneh030:26800] Signal code: Address not mapped (1)
> [d6cneh030:26800] Failing at address: 0x21
> [d6cneh042:28543] Signal: Segmentation fault (11)
> [d6cneh042:28543] Signal code: Address not mapped (1)
> [d6cneh042:28543] Failing at address: 0x10
> [d6cneh021:27646] [ 0] /lib64/libpthread.so.0 [0x39aee0eb10]
> [d6cneh021:27646] [ 1] /opt/crc/openmpi/1.4.3/intel-12.0/lib/libmpi.so.0 [0x2af8b1c8bca8]
> [d6cneh021:27646] [ 2] /opt/crc/openmpi/1.4.3/intel-12.0/lib/libmpi.so.0 [0x2af8b1c8a1ef]
> [d6cneh021:27646] [ 3] /opt/crc/openmpi/1.4.3/intel-12.0/lib/libmpi.so.0 [0x2af8b1c16246]
> [d6cneh021:27646] [ 4] /opt/crc/openmpi/1.4.3/intel-12.0/lib/libopen-pal.so.0(opal_progress+0x86) [0x2af8b22a6a26]
> [d6cneh021:27646] [ 5] /opt/crc/openmpi/1.4.3/intel-12.0/lib/libmpi.so.0 [0x2af8b1c879e7]
> [d6cneh021:27646] [ 6] /opt/crc/openmpi/1.4.3/intel-12.0/lib/libmpi.so.0 [0x2af8b1c1f701]
> [d6cneh021:27646] [ 7] /opt/crc/openmpi/1.4.3/intel-12.0/lib/libmpi.so.0 [0x2af8b1c1aec9]
> [d6cneh021:27646] [ 8] /opt/crc/openmpi/1.4.3/intel-12.0/lib/libmpi.so.0(MPI_Allreduce+0x73) [0x2af8b1be6203]
> [d6cneh021:27646] [ 9] /opt/crc/openmpi/1.4.3/intel-12.0/lib/libmpi_f77.so.0(MPI_ALLREDUCE+0xc5) [0x2af8b1977715]
> [d6cneh021:27646] [10] openmd_MPI [0x5e0b94]
> [d6cneh021:27646] [11] openmd_MPI [0x599877]
> [d6cneh021:27646] [12] openmd_MPI [0x5746e8]
> [d6cneh021:27646] [13] openmd_MPI [0x4f18b8]

Can anyone give some insight into the issue? I should note (as it may be relevant) that this job was run across a heterogeneous cluster of Intel Nehalem servers with a mixture of Infiniband and Ethernet connections. The OpenMPI compiler was built with no IB libraries (so I am assuming everything defaults to a TCP transport?).

Thanks in advance for any insight that may help us identify the issue.

Regards.

Tim.

Tim Stitt PhD (User Support Manager).
Center for Research Computing | University of Notre Dame |
P.O. Box 539, Notre Dame, IN 46556 | Phone: 574-631-5287 | Email: tstitt_at_[hidden]<mailto:tstitt_at_[hidden]>