Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] "casual" error
From: Biagio Lucini (B.Lucini_at_[hidden])
Date: 2009-03-05 17:37:28


We have an application that runs for a very long time with 16 processes
(the time is order a few months; we do have check points, but this won't
be the issue). It has happened twice that it fails with the error
message appended below after running undisturbed for 20-25 days. It has
happened twice so far. This error is not systematically reproducible,
and I believe this is not just because the program is parallel. We use
openmpi-1.2.5 as distributed in the RH 5.2-clone Scientific Linux, on
which our cluster is based. Is this stack suggesting anything to eyes
more trained than main?

Many thanks,
Biagio Lucini

-----------------------------------------------------------------------------------------------------------------------------------------

[node20:04178] *** Process received signal ***
[node20:04178] Signal: Segmentation fault (11)
[node20:04178] Signal code: Address not mapped (1)
[node20:04178] Failing at address: 0x2aaadb8b31a0
[node20:04178] [ 0] /lib64/libpthread.so.0 [0x2b5d9c3ebe80]
[node20:04178] [ 1]
/usr/lib64/openmpi/1.2.5-gcc/lib/libopen-pal.so.0(_int_malloc+0x1d4)
[0x2b5d9ccb2
f84]
[node20:04178] [ 2]
/usr/lib64/openmpi/1.2.5-gcc/lib/libopen-pal.so.0(malloc+0x93)
[0x2b5d9ccb4d93]
[node20:04178] [ 3] /lib64/libc.so.6 [0x2b5d9d77729a]
[node20:04178] [ 4]
/usr/lib64/libstdc++.so.6(_ZNSt12__basic_fileIcE4openEPKcSt13_Ios_Openmodei+0x54)
 [0x2b5d9bf05cb4]
[node20:04178] [ 5]
/usr/lib64/libstdc++.so.6(_ZNSt13basic_filebufIcSt11char_traitsIcEE4openEPKcSt13_
Ios_Openmode+0x83) [0x2b5d9beb45c3]
[node20:04178] [ 6] ./k-string(wait_thread_+0x2a1) [0x42e101]
[node20:04178] [ 7] ./k-string(MAIN__+0x2a72) [0x4212d2]
[node20:04178] [ 8] ./k-string(main+0xe) [0x42e2ce]
[node20:04178] [ 9] /lib64/libc.so.6(__libc_start_main+0xf4)
[0x2b5d9d7338b4]
[node20:04178] [10] ./k-string(__gxx_personality_v0+0xb9) [0x404719]
[node20:04178] *** End of error message ***
mpirun noticed that job rank 0 with PID 4152 on node node19 exited on
signal 15 (Terminated).