Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] "casual" error
From: Biagio Lucini (B.Lucini_at_[hidden])
Date: 2009-03-05 18:43:32

Many thanks for your help, it was not clear to me whether it was opal,
my application or the standard C libs that were causing the segfault. It
is already good news that the problem is not at the level of OpenMPI,
since this would have meant upgrading that library. My first reaction
would be to say that there is nothing wrong with my code (which has
already passed the valgrind test) and the problem should be in the libc,
but I agree with you that this is a very unlikely possibility,
especially given that we do some remapping of the memory. Hence, I will
give a second look with valgrind and a third with efence, and see if
there is some bug that managed to survive the extensive testing that the
code has undergone up to now.

Thanks again,

George Bosilca wrote:
> Absolutely :) The last few entries on the stack are from OPAL (one of
> the Open MPI libraries) that trap the segfault. Everything else
> indicates where the segfault happened. What I can tell from this stack
> trace is the following: the problem started in your function
> wait_thread which called one of the functions from the libstdc++
> (based on the C++ naming conventions and the name from the stack
> _ZNSt13basic_filebufIcSt11char_traitsIcEE4openEPKcSt13_ I guess it was
> open), which called some undetermined function from the libc ... which
> segfault.
> It is pretty strange to segfault in a standard function, they are
> usually pretty well protected, except if you do something blatantly
> wrong (such as messing up the memory). I suggest using some memory
> checker tools such as valgrind to check the memory consistency of your
> application.
> george.
> On Mar 5, 2009, at 17:37 , Biagio Lucini wrote:
>> We have an application that runs for a very long time with 16
>> processes (the time is order a few months; we do have check points,
>> but this won't be the issue). It has happened twice that it fails
>> with the error message appended below after running undisturbed for
>> 20-25 days. It has happened twice so far. This error is not
>> systematically reproducible, and I believe this is not just because
>> the program is parallel. We use openmpi-1.2.5 as distributed in the
>> RH 5.2-clone Scientific Linux, on which our cluster is based. Is this
>> stack suggesting anything to eyes more trained than main?
>> Many thanks,
>> Biagio Lucini
>> -----------------------------------------------------------------------------------------------------------------------------------------
>> [node20:04178] *** Process received signal ***
>> [node20:04178] Signal: Segmentation fault (11)
>> [node20:04178] Signal code: Address not mapped (1)
>> [node20:04178] Failing at address: 0x2aaadb8b31a0
>> [node20:04178] [ 0] /lib64/ [0x2b5d9c3ebe80]
>> [node20:04178] [ 1]
>> /usr/lib64/openmpi/1.2.5-gcc/lib/
>> [0x2b5d9ccb2
>> f84]
>> [node20:04178] [ 2]
>> /usr/lib64/openmpi/1.2.5-gcc/lib/
>> [0x2b5d9ccb4d93]
>> [node20:04178] [ 3] /lib64/ [0x2b5d9d77729a]
>> [node20:04178] [ 4]
>> /usr/lib64/
>> [0x2b5d9bf05cb4]
>> [node20:04178] [ 5]
>> /usr/lib64/
>> Ios_Openmode+0x83) [0x2b5d9beb45c3]
>> [node20:04178] [ 6] ./k-string(wait_thread_+0x2a1) [0x42e101]
>> [node20:04178] [ 7] ./k-string(MAIN__+0x2a72) [0x4212d2]
>> [node20:04178] [ 8] ./k-string(main+0xe) [0x42e2ce]
>> [node20:04178] [ 9] /lib64/
>> [0x2b5d9d7338b4]
>> [node20:04178] [10] ./k-string(__gxx_personality_v0+0xb9) [0x404719]
>> [node20:04178] *** End of error message ***
>> mpirun noticed that job rank 0 with PID 4152 on node node19 exited on
>> signal 15 (Terminated).
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
> _______________________________________________
> users mailing list
> users_at_[hidden]