Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Corwell, Sophia (secorwe_at_[hidden])
Date: 2007-06-11 14:55:28


Hi,

We are seeing the following issue with Iprobe on our clusters running
openmpi-1.2.2. Here is the code and related information:

=======
Modules currently loaded:

(sn31)/projects>module list
> > Currently Loaded Modulefiles:
> > 1) /opt/modules/oscar-modulefiles/default-manpath/1.0.1
> > 2) compilers/intel-9.1-f040-c045
> > 3) misc/env-openmpi-1.2
> > 4) mpi/openmpi-1.2.2_mx_intel-9.1-f040-c045
> > 5) libraries/intel-mkl
=======

Source code:

> >
> > (sn31)/projects/>more probeTest.cc
> >
> > #include <mpi.h>
> > #include <cassert>
> >
> > int main(int argc, char* argv[])
> > {
> > MPI::Init(argc, argv);
> >
> > const int rank = MPI::COMM_WORLD.Get_rank();
> > const int size = MPI::COMM_WORLD.Get_size();
> > const int sendProc = (rank + size - 1) % size;
> > const int recvProc = (rank + 1) % size;
> > const int tag = 1;
> >
> > // send an asynchronous message
> > const int sendVal = 1;
> > MPI::Request sendRequest =
> > MPI::COMM_WORLD.Isend(&sendVal, 1, MPI_INT, recvProc, tag);
> >
> > // wait for message to arrive
> > while (!MPI::COMM_WORLD.Iprobe(sendProc, tag)) {} // This line
> > causes problems
> >
> > // Receive asynchronous message
> > int recvVal;
> > MPI::Request recvRequest =
> > MPI::COMM_WORLD.Irecv(&recvVal, 1, MPI_INT, sendProc, tag);
> > recvRequest.Wait();
> >
> > MPI::Finalize();
> > }
=======

Compiled with:

> > (sn31)/projects>/apps/x86_64/mpi/openmpi/intel-9.1-f040-c045/openmpi
> > -1.2.2_mx/bin/mpicxx
> > -I/apps/x86_64/mpi/openmpi/intel-9.1-f040-c045/openmpi-1.2.2_m
> > x/include -g -c -o probeTest.o probeTest.cc
> >
> > (sn31)/projects>/apps/x86_64/mpi/openmpi/intel-9.1-f040-c045/openmpi
> > -1.2.2_mx/bin/mpicxx -g -o probeTest
> > -L/apps/x86_64/mpi/openmpi/intel-9.1-f040-c045/openmpi-1.2.2_mx/lib
> > probeTest.o -lmpi
> >
/projects/global/x86_64/compilers/intel/intel-9.1-cce-045/lib/ibimf.so:
> > warning: warning: feupdateenv is not implemented and will always
> > fail
> >

=======

Error at runtime:

> >
> > (sn31)/projects>mpiexec -n 1 ./probeTest [sn31:17616] *** Process
> > received signal *** [sn31:17616] Signal:
> > Segmentation fault (11) [sn31:17616] Signal code: Address not mapped
> > (1) [sn31:17616] Failing at address: 0x8 [sn31:17616] [ 0]
> > /lib64/tls/libpthread.so.0 [0x2a9665a4f0] [sn31:17616] [ 1]
> > /apps/x86_64/mpi/openmpi/intel-9.1-f040-c045/openmpi-1.2.2_mx/
> > lib/openmpi/mca_mtl_mx.so(ompi_mtl_mx_iprobe+0x81)
> > [0x2a9980b305]
> > [sn31:17616] [ 2]
> > /apps/x86_64/mpi/openmpi/intel-9.1-f040-c045/openmpi-1.2.2_mx/
> > lib/openmpi/mca_pml_cm.so(mca_pml_cm_iprobe+0x1f)
> > [0x2a995eb817]
> > [sn31:17616] [ 3]
> > /apps/x86_64/mpi/openmpi/intel-9.1-f040-c045/openmpi-1.2.2_mx/
> > lib/libmpi.so.0(MPI_Iprobe+0xef)
> > [0x2a956d363f]
> > [sn31:17616] [ 4] ./probeTest(_ZNK3MPI4Comm6IprobeEii+0x3a)
> > [0x4046aa][sn31:17616] [ 5] ./probeTest(main+0x147) [0x40480b]
> > [sn31:17616] [ 6] /lib64/tls/libc.so.6(__libc_start_main+0xdb)
> > [0x2a967803fb]
> > [sn31:17616] [ 7] ./probeTest(_ZNSt8ios_base4InitD1Ev+0x3a)
> > [0x4038ca][sn31:17616] *** End of error message *** mpiexec noticed
> > that job rank 0 with PID 17616 on node sn31 exited
on
> > signal 11 (Segmentation fault).
> >
> > (sn31)/projects/ceptre/sdpautz/NWCC/temp>mpiexec -n 2 ./probeTest
> > [sn31:17621] *** Process received signal *** [sn31:17620] ***
Process
> > received signal *** [sn31:17620] Signal: Segmentation fault (11)
> > [sn31:17620] Signal code: Address not mapped (1) [sn31:17620]
> > Failing at address: 0x8 [sn31:17620] [ 0] /lib64/tls/libpthread.so.0

> > [0x2a9665a4f0] [sn31:17620] [ 1]
> > /apps/x86_64/mpi/openmpi/intel-9.1-f040-c045/openmpi-1.2.2_mx/
> > lib/openmpi/mca_mtl_mx.so(ompi_mtl_mx_iprobe+0x81)
> > [0x2a9980b305]
> > [sn31:17620] [ 2]
> > /apps/x86_64/mpi/openmpi/intel-9.1-f040-c045/openmpi-1.2.2_mx/
> > lib/openmpi/mca_pml_cm.so(mca_pml_cm_iprobe+0x1f)
> > [0x2a995eb817]
> > [sn31:17620] [ 3]
> > /apps/x86_64/mpi/openmpi/intel-9.1-f040-c045/openmpi-1.2.2_mx/
> > lib/libmpi.so.0(MPI_Iprobe+0xef)
> > [0x2a956d363f]
> > [sn31:17620] [ 4] ./probeTest(_ZNK3MPI4Comm6IprobeEii+0x3a)
> > [0x4046aa][sn31:17620] [ 5] ./probeTest(main+0x147) [0x40480b]
> > [sn31:17620] [ 6] /lib64/tls/libc.so.6(__libc_start_main+0xdb)
> > [0x2a967803fb]
> > [sn31:17620] [ 7] ./probeTest(_ZNSt8ios_base4InitD1Ev+0x3a)
> > [0x4038ca][sn31:17620] *** End of error message *** [sn31:17621]
> > Signal: Segmentation fault (11) [sn31:17621] Signal code: Address
> > not mapped (1) [sn31:17621] Failing at address: 0x8 [sn31:17621] [
> > 0] /lib64/tls/libpthread.so.0 [0x2a9665a4f0] [sn31:17621] [ 1]
> > /apps/x86_64/mpi/openmpi/intel-9.1-f040-c045/openmpi-1.2.2_mx/
> > lib/openmpi/mca_mtl_mx.so(ompi_mtl_mx_iprobe+0x81)
> > [0x2a9980b305]
> > [sn31:17621] [ 2]
> > /apps/x86_64/mpi/openmpi/intel-9.1-f040-c045/openmpi-1.2.2_mx/
> > lib/openmpi/mca_pml_cm.so(mca_pml_cm_iprobe+0x1f)
> > [0x2a995eb817]
> > [sn31:17621] [ 3]
> > /apps/x86_64/mpi/openmpi/intel-9.1-f040-c045/openmpi-1.2.2_mx/
> > lib/libmpi.so.0(MPI_Iprobe+0xef)
> > [0x2a956d363f]
> > [sn31:17621] [ 4] ./probeTest(_ZNK3MPI4Comm6IprobeEii+0x3a)
> > [0x4046aa][sn31:17621] [ 5] ./probeTest(main+0x1ad) [0x404871]
> > [sn31:17621] [ 6] /lib64/tls/libc.so.6(__libc_start_main+0xdb)
> > [0x2a967803fb]
> > [sn31:17621] [ 7] ./probeTest(_ZNSt8ios_base4InitD1Ev+0x3a)
> > [0x4038ca][sn31:17621] *** End of error message *** mpiexec noticed
> > that job rank 0 with PID 17620 on node sn31 exited
on
> > signal 11 (Segmentation fault).
> > 1 additional process aborted (not shown)
> >

=======

Additional Information:

> > It appears that the call of Iprobe causes problems; if that line is
> > taken out, the code completes normally. Failures also occur with
the gcc compilers.

> > Mpich appears to work, at least for the Intel compiler.

=======

Hardware information:

[root_at_spirit1 ~]# mx_info -q
MX Version: 1.2.1-rc20
MX Build:
root_at_[hidden]:/projects/global/src/myricom/mx-1.2.1-rc20 Thu Jun
7 17:08:02 MDT 2007
1 Myrinet board installed.
The MX driver is configured to support a maximum of:
        8 endpoints per NIC, 1024 NICs on the network, 32 NICs per host
===================================================================
Instance #0: 333.2 MHz LANai, 133.3 MHz PCI bus, 4 MB SRAM
        Status: Running, P0: Link up, P1: Link up
        Network: Myrinet 2000
 
        MAC Address: 00:60:dd:48:ba:ae
        Product code: M3F2-PCIXE-4
        Part number: 09-02878
        Serial number: 219851
        Mapper (P0): 00:60:dd:48:c0:08, version = 0x01920f75,
configured
        Mapped hosts: 506
        Mapper (P1): 00:60:dd:48:c0:08, version = 0x01920f75,
configured
        Mapped hosts: 506
 

cat
/apps/x86_64/mpi/openmpi/intel-9.1-f040-c045/openmpi-1.2.2_mx/BUILD_ENV
# Build Environment:
USE="doc icc modules mx torque"
COMPILER="intel-9.1-f040-c045"
CC="icc"
CXX="icpc"
CLINKER="icc"
FC="ifort"
F77="ifort"
CFLAGS=" -O3 -pipe"
CXXFLAGS=" -O3 -pipe"
FFLAGS=" -O3"
MODULE_DEST="/apps/modules/modulefiles/mpi"
MODULE_FILE="openmpi-1.2.2_mx_intel-9.1-f040-c045"
INSTALL_DEST="/apps/x86_64/mpi/openmpi/intel-9.1-f040-c045/openmpi-1.2.2
_mx"
CONF_FLAGS=" --with-mx=/opt/mx --with-tm=/apps/torque"
=======

Thanks in advance for any help/advice you can provide.

-Sophia