Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] threading bug?
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-03-04 15:04:20


On Feb 27, 2009, at 1:56 PM, Mahmoud Payami wrote:

> I am using intel lc_prof-11 (and its own mkl) and have built
> openmpi-1.3.1 with connfigure options: "FC=ifort F77=ifort CC=icc
> CXX=icpc". Then I have built my application.
> The linux box is 2Xamd64 quad. In the middle of running of my
> application (after some 15 iterations), I receive the message and
> stops.
> I tried to configure openmpi using "--disable-mpi-threads" but it
> automatically assumes "posix".

This doesn't sound like a threading problem, thankfully. Open MPI has
two levels of threading issues:

- whether MPI_THREAD_MULTIPLE is supported or not (which is what --
enable|disable-mpi-threads does)
- whether thread support is present at all on the system (e.g.,
solaris or posix threads)

You see "posix" in the configure output mainly because OMPI still
detects that posix threads are available on the system. It doesn't
necessarily mean that threads will be used in your application's run.

> This problem does not happen in openmpi-1.2.9.
> Any comment is highly appreciated.
> Best regards,
> mahmoud payami
>
>
> [hpc1:25353] *** Process received signal ***
> [hpc1:25353] Signal: Segmentation fault (11)
> [hpc1:25353] Signal code: Address not mapped (1)
> [hpc1:25353] Failing at address: 0x51
> [hpc1:25353] [ 0] /lib64/libpthread.so.0 [0x303be0dd40]
> [hpc1:25353] [ 1] /opt/openmpi131_cc/lib/
> openmpi/mca_pml_ob1.so [0x2aaaae350d96]
> [hpc1:25353] [ 2] /opt/openmpi131_cc/lib/
> openmpi/mca_pml_ob1.so [0x2aaaae3514a8]
> [hpc1:25353] [ 3] /opt/openmpi131_cc/lib/openmpi/mca_btl_sm.so
> [0x2aaaaeb7c72a]
> [hpc1:25353] [ 4] /opt/openmpi131_cc/lib/libopen-pal.so.
> 0(opal_progress+0x89) [0x2aaaab42b7d9]
> [hpc1:25353] [ 5] /opt/openmpi131_cc/lib/openmpi/mca_pml_ob1.so
> [0x2aaaae34d27c]
> [hpc1:25353] [ 6] /opt/openmpi131_cc/lib/libmpi.so.0(PMPI_Recv
> +0x210) [0x2aaaaaf46010]
> [hpc1:25353] [ 7] /opt/openmpi131_cc/lib/libmpi_f77.so.0(mpi_recv
> +0xa4) [0x2aaaaacd6af4]
> [hpc1:25353] [ 8] /opt/QE131_cc/bin/pw.x(parallel_toolkit_mp_zsqmred_
> +0x13da) [0x513d8a]
> [hpc1:25353] [ 9] /opt/QE131_cc/bin/pw.x(pcegterg_+0x6c3f) [0x6667ff]
> [hpc1:25353] [10] /opt/QE131_cc/bin/pw.x(diag_bands_+0xb9e) [0x65654e]
> [hpc1:25353] [11] /opt/QE131_cc/bin/pw.x(c_bands_+0x277) [0x6575a7]
> [hpc1:25353] [12] /opt/QE131_cc/bin/pw.x(electrons_+0x53f) [0x58a54f]
> [hpc1:25353] [13] /opt/QE131_cc/bin/pw.x(MAIN__+0x1fb) [0x458acb]
> [hpc1:25353] [14] /opt/QE131_cc/bin/pw.x(main+0x3c) [0x4588bc]
> [hpc1:25353] [15] /lib64/libc.so.6(__libc_start_main+0xf4)
> [0x303b21d8a4]
> [hpc1:25353] [16] /opt/QE131_cc/bin/pw.x(realloc+0x1b9) [0x4587e9]
> [hpc1:25353] *** End of error message ***
> --------------------------------------------------------------------------
> mpirun noticed that process rank 6 with PID 25353 on node hpc1
> exited on signal 11 (Segmentation fault).
> --------------------------------------------------------------------------

What this stack trace tells us is that Open MPI crashed somewhere
while trying to use shared memory for message passing, but it doesn't
really tell us much else. It's not clear, either, whether this is
OMPI's fault or your app's fault (or something else).

Can you run your application through a memory-checking debugger to see
if anything obvious pops out?

-- 
Jeff Squyres
Cisco Systems