Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] threading bug?
From: Douglas Guptill (douglas.guptill_at_[hidden])
Date: 2009-03-06 06:52:26


I once had a crash in libpthread something like the one below. The
very un-obvious cause was a stack overflow on subroutine entry - large
automatic array.

HTH,
Douglas.

On Wed, Mar 04, 2009 at 03:04:20PM -0500, Jeff Squyres wrote:
> On Feb 27, 2009, at 1:56 PM, Mahmoud Payami wrote:
>
> >I am using intel lc_prof-11 (and its own mkl) and have built
> >openmpi-1.3.1 with connfigure options: "FC=ifort F77=ifort CC=icc
> >CXX=icpc". Then I have built my application.
> >The linux box is 2Xamd64 quad. In the middle of running of my
> >application (after some 15 iterations), I receive the message and
> >stops.
> >I tried to configure openmpi using "--disable-mpi-threads" but it
> >automatically assumes "posix".
>
> This doesn't sound like a threading problem, thankfully. Open MPI has
> two levels of threading issues:
>
> - whether MPI_THREAD_MULTIPLE is supported or not (which is what --
> enable|disable-mpi-threads does)
> - whether thread support is present at all on the system (e.g.,
> solaris or posix threads)
>
> You see "posix" in the configure output mainly because OMPI still
> detects that posix threads are available on the system. It doesn't
> necessarily mean that threads will be used in your application's run.
>
> >This problem does not happen in openmpi-1.2.9.
> >Any comment is highly appreciated.
> >Best regards,
> > mahmoud payami
> >
> >
> >[hpc1:25353] *** Process received signal ***
> >[hpc1:25353] Signal: Segmentation fault (11)
> >[hpc1:25353] Signal code: Address not mapped (1)
> >[hpc1:25353] Failing at address: 0x51
> >[hpc1:25353] [ 0] /lib64/libpthread.so.0 [0x303be0dd40]
> >[hpc1:25353] [ 1] /opt/openmpi131_cc/lib/
> >openmpi/mca_pml_ob1.so [0x2aaaae350d96]
> >[hpc1:25353] [ 2] /opt/openmpi131_cc/lib/
> >openmpi/mca_pml_ob1.so [0x2aaaae3514a8]
> >[hpc1:25353] [ 3] /opt/openmpi131_cc/lib/openmpi/mca_btl_sm.so
> >[0x2aaaaeb7c72a]
> >[hpc1:25353] [ 4] /opt/openmpi131_cc/lib/libopen-pal.so.
> >0(opal_progress+0x89) [0x2aaaab42b7d9]
> >[hpc1:25353] [ 5] /opt/openmpi131_cc/lib/openmpi/mca_pml_ob1.so
> >[0x2aaaae34d27c]
> >[hpc1:25353] [ 6] /opt/openmpi131_cc/lib/libmpi.so.0(PMPI_Recv
> >+0x210) [0x2aaaaaf46010]
> >[hpc1:25353] [ 7] /opt/openmpi131_cc/lib/libmpi_f77.so.0(mpi_recv
> >+0xa4) [0x2aaaaacd6af4]
> >[hpc1:25353] [ 8] /opt/QE131_cc/bin/pw.x(parallel_toolkit_mp_zsqmred_
> >+0x13da) [0x513d8a]
> >[hpc1:25353] [ 9] /opt/QE131_cc/bin/pw.x(pcegterg_+0x6c3f) [0x6667ff]
> >[hpc1:25353] [10] /opt/QE131_cc/bin/pw.x(diag_bands_+0xb9e) [0x65654e]
> >[hpc1:25353] [11] /opt/QE131_cc/bin/pw.x(c_bands_+0x277) [0x6575a7]
> >[hpc1:25353] [12] /opt/QE131_cc/bin/pw.x(electrons_+0x53f) [0x58a54f]
> >[hpc1:25353] [13] /opt/QE131_cc/bin/pw.x(MAIN__+0x1fb) [0x458acb]
> >[hpc1:25353] [14] /opt/QE131_cc/bin/pw.x(main+0x3c) [0x4588bc]
> >[hpc1:25353] [15] /lib64/libc.so.6(__libc_start_main+0xf4)
> >[0x303b21d8a4]
> >[hpc1:25353] [16] /opt/QE131_cc/bin/pw.x(realloc+0x1b9) [0x4587e9]
> >[hpc1:25353] *** End of error message ***
> >--------------------------------------------------------------------------
> >mpirun noticed that process rank 6 with PID 25353 on node hpc1
> >exited on signal 11 (Segmentation fault).
> >--------------------------------------------------------------------------
>
> What this stack trace tells us is that Open MPI crashed somewhere
> while trying to use shared memory for message passing, but it doesn't
> really tell us much else. It's not clear, either, whether this is
> OMPI's fault or your app's fault (or something else).
>
> Can you run your application through a memory-checking debugger to see
> if anything obvious pops out?
>
> --
> Jeff Squyres
> Cisco Systems
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users