Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] "casual" error
From: Biagio Lucini (B.Lucini_at_[hidden])
Date: 2009-03-05 17:37:28

We have an application that runs for a very long time with 16 processes
(the time is order a few months; we do have check points, but this won't
be the issue). It has happened twice that it fails with the error
message appended below after running undisturbed for 20-25 days. It has
happened twice so far. This error is not systematically reproducible,
and I believe this is not just because the program is parallel. We use
openmpi-1.2.5 as distributed in the RH 5.2-clone Scientific Linux, on
which our cluster is based. Is this stack suggesting anything to eyes
more trained than main?

Many thanks,
Biagio Lucini


[node20:04178] *** Process received signal ***
[node20:04178] Signal: Segmentation fault (11)
[node20:04178] Signal code: Address not mapped (1)
[node20:04178] Failing at address: 0x2aaadb8b31a0
[node20:04178] [ 0] /lib64/ [0x2b5d9c3ebe80]
[node20:04178] [ 1]
[node20:04178] [ 2]
[node20:04178] [ 3] /lib64/ [0x2b5d9d77729a]
[node20:04178] [ 4]
[node20:04178] [ 5]
Ios_Openmode+0x83) [0x2b5d9beb45c3]
[node20:04178] [ 6] ./k-string(wait_thread_+0x2a1) [0x42e101]
[node20:04178] [ 7] ./k-string(MAIN__+0x2a72) [0x4212d2]
[node20:04178] [ 8] ./k-string(main+0xe) [0x42e2ce]
[node20:04178] [ 9] /lib64/
[node20:04178] [10] ./k-string(__gxx_personality_v0+0xb9) [0x404719]
[node20:04178] *** End of error message ***
mpirun noticed that job rank 0 with PID 4152 on node node19 exited on
signal 15 (Terminated).