Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: [OMPI users] Segmentation fault when checkpointing
From: Linton, Tom (tom.linton_at_[hidden])
Date: 2012-03-29 13:12:47

We have a legacy application that runs fine on our cluster using Intel MPI with hundreds of cores. We ported it to OpenMPI so that we could use BLCR and it runs fine but checkpointing is not working properly:

1. when we checkpoint with more than 1 core, each MPI rank reports a segmentation fault for the MPI job and the ompi-checkpoint command does not return. For example, with two cores we get:
[tscco28017:16352] *** Process received signal ***
[tscco28017:16352] Signal: Segmentation fault (11)
[tscco28017:16352] Signal code: Address not mapped (1)
[tscco28017:16352] Failing at address: 0x7fffef51
[tscco28017:16353] *** Process received signal ***
[tscco28017:16353] Signal: Segmentation fault (11)
[tscco28017:16353] Signal code: Address not mapped (1)
[tscco28017:16353] Failing at address: 0x7fffef51
[tscco28017:16353] [ 0] /lib64/ [0x7ffff698e5d0]
[tscco28017:16353] [ 1] [0xf500b0]
[tscco28017:16353] *** End of error message ***
[tscco28017:16352] [ 0] /lib64/ [0x7ffff698e5d0]
[tscco28017:16352] [ 1] [0xf500b0]
[tscco28017:16352] *** End of error message ***
mpirun noticed that process rank 1 with PID 16353 on node tscco28017 exited on signal 11 (Segmentation fault).
When I execute the TotalView debugger on a resulting core file (I assume it's for the rank 0 process), Totalview reports a null frame pointer and the stack is trashed (gdb shows a backtrace with 30 frames but shows no debug info).

2. Checkpointing with 1 core on the legacy program works.
3. Checkpointing with a simple test program on 16 cores works.

Can you suggest how to debug this problem?

Some additional information:

* I execute the program like this: mpirun -am ft-enable-cr -n 2 -machinefile machines program inputfile

* We are using Open MPI 1.4.4 with BLCR 0.8.4

* OpenMPI and the application were both compiled on the same machine using the Intel icc 12.0.4 compiler

* For the failing example, both MPI processes are running on cores on the same machine node.

* I have attached "ompi_info.txt"

* We're running on a single Xeon 5150 node with Gigabit Ethernet.

* [Reuti: previously I reported a problem involving illegal instructions but this turned out to be a build problem. Sorry I didn't answer your response to my previous thread but I was having problems with accessing this email list at that time.]