Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] "casual" error
From: George Bosilca (bosilca_at_[hidden])
Date: 2009-03-05 18:23:50


Absolutely :) The last few entries on the stack are from OPAL (one of
the Open MPI libraries) that trap the segfault. Everything else
indicates where the segfault happened. What I can tell from this stack
trace is the following: the problem started in your function
wait_thread which called one of the functions from the libstdc++
(based on the C++ naming conventions and the name from the stack
_ZNSt13basic_filebufIcSt11char_traitsIcEE4openEPKcSt13_ I guess it was
open), which called some undetermined function from the libc ... which
segfault.

It is pretty strange to segfault in a standard function, they are
usually pretty well protected, except if you do something blatantly
wrong (such as messing up the memory). I suggest using some memory
checker tools such as valgrind to check the memory consistency of your
application.

   george.

On Mar 5, 2009, at 17:37 , Biagio Lucini wrote:

> We have an application that runs for a very long time with 16
> processes (the time is order a few months; we do have check points,
> but this won't be the issue). It has happened twice that it fails
> with the error message appended below after running undisturbed for
> 20-25 days. It has happened twice so far. This error is not
> systematically reproducible, and I believe this is not just because
> the program is parallel. We use openmpi-1.2.5 as distributed in the
> RH 5.2-clone Scientific Linux, on which our cluster is based. Is
> this stack suggesting anything to eyes more trained than main?
>
> Many thanks,
> Biagio Lucini
>
> -----------------------------------------------------------------------------------------------------------------------------------------
>
> [node20:04178] *** Process received signal ***
> [node20:04178] Signal: Segmentation fault (11)
> [node20:04178] Signal code: Address not mapped (1)
> [node20:04178] Failing at address: 0x2aaadb8b31a0
> [node20:04178] [ 0] /lib64/libpthread.so.0 [0x2b5d9c3ebe80]
> [node20:04178] [ 1] /usr/lib64/openmpi/1.2.5-gcc/lib/libopen-pal.so.
> 0(_int_malloc+0x1d4) [0x2b5d9ccb2
> f84]
> [node20:04178] [ 2] /usr/lib64/openmpi/1.2.5-gcc/lib/libopen-pal.so.
> 0(malloc+0x93) [0x2b5d9ccb4d93]
> [node20:04178] [ 3] /lib64/libc.so.6 [0x2b5d9d77729a]
> [node20:04178] [ 4] /usr/lib64/libstdc++.so.
> 6(_ZNSt12__basic_fileIcE4openEPKcSt13_Ios_Openmodei+0x54)
> [0x2b5d9bf05cb4]
> [node20:04178] [ 5] /usr/lib64/libstdc++.so.
> 6(_ZNSt13basic_filebufIcSt11char_traitsIcEE4openEPKcSt13_
> Ios_Openmode+0x83) [0x2b5d9beb45c3]
> [node20:04178] [ 6] ./k-string(wait_thread_+0x2a1) [0x42e101]
> [node20:04178] [ 7] ./k-string(MAIN__+0x2a72) [0x4212d2]
> [node20:04178] [ 8] ./k-string(main+0xe) [0x42e2ce]
> [node20:04178] [ 9] /lib64/libc.so.6(__libc_start_main+0xf4)
> [0x2b5d9d7338b4]
> [node20:04178] [10] ./k-string(__gxx_personality_v0+0xb9) [0x404719]
> [node20:04178] *** End of error message ***
> mpirun noticed that job rank 0 with PID 4152 on node node19 exited
> on signal 15 (Terminated).
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users