We have a legacy application that runs fine on our cluster using Intel MPI with hundreds of cores. We ported it to OpenMPI so that we could use BLCR and it runs fine but checkpointing is not working properly:

 

1. when we checkpoint with more than 1 core, each MPI rank reports a segmentation fault for the MPI job and the ompi-checkpoint command does not return. For example, with two cores we get:

[tscco28017:16352] *** Process received signal ***

[tscco28017:16352] Signal: Segmentation fault (11)

[tscco28017:16352] Signal code: Address not mapped (1)

[tscco28017:16352] Failing at address: 0x7fffef51

[tscco28017:16353] *** Process received signal ***

[tscco28017:16353] Signal: Segmentation fault (11)

[tscco28017:16353] Signal code: Address not mapped (1)

[tscco28017:16353] Failing at address: 0x7fffef51

[tscco28017:16353] [ 0] /lib64/libpthread.so.0(+0xf5d0) [0x7ffff698e5d0]

[tscco28017:16353] [ 1] [0xf500b0]

[tscco28017:16353] *** End of error message ***

[tscco28017:16352] [ 0] /lib64/libpthread.so.0(+0xf5d0) [0x7ffff698e5d0]

[tscco28017:16352] [ 1] [0xf500b0]

[tscco28017:16352] *** End of error message ***

--------------------------------------------------------------------------

mpirun noticed that process rank 1 with PID 16353 on node tscco28017 exited on signal 11 (Segmentation fault).

--------------------------------------------------------------------------

When I execute the TotalView debugger on a resulting core file (I assume it’s for the rank 0 process), Totalview reports a null frame pointer and the stack is trashed (gdb shows a backtrace with 30 frames but shows no debug info).

 

2. Checkpointing with 1 core on the legacy program works.

3. Checkpointing with a simple test program on 16 cores works.

 

 

Can you suggest how to debug this problem?

 

Some additional information:

 

·        I execute the program like this: mpirun -am ft-enable-cr -n 2 -machinefile machines program inputfile

·        We are using Open MPI 1.4.4 with BLCR 0.8.4

·        OpenMPI and the application were both compiled on the same machine using the Intel icc 12.0.4 compiler

·        For the failing example, both MPI processes are running on cores on the same machine node.

·        I have attached “ompi_info.txt”

·        We’re running on a single Xeon 5150 node with Gigabit Ethernet.

·        [Reuti: previously I reported a problem involving illegal instructions but this turned out to be a build problem. Sorry I didn’t answer your response to my previous thread but I was having problems with accessing this email list at that time.]

 

Thanks

Tom