We have a legacy application that runs fine on our cluster using Intel MPI. We ported it to OpenMPI so that we could use BLCR and it runs fine but checkpointing is not working properly:
1. when we checkpoint with more than 1 core, it executes with the error:
mpirun noticed that process rank 1 with PID 8260 on node tscco28017 exited on signal 4 (Illegal instruction).
2. checkpointing with 1 core works
3. we have a simple test program that exercises MPI with multiple cores and it checkpoints fine on multiple cores
Can you suggest how to debug this problem?
Some additional information:
1. I execute the program like this: mpirun -am ft-enable-cr -n 2 -machinefile machines program inputfile
2. when I checkpoint it, I see that the checkpoint directories are created but the file “global_snapshot_meta.data” is not complete, there is no restart-appfile, the “snapshot_meta.data” files are not complete, and there are no dump files for the individual processes.
3. the command “ompi-checkpoint” doesn’t return; I have to control-C to kill it after checkpointing.
4. We are using Open MPI 1.4.4 with BLCR 0.8.4
5. I have attached “ompi_info.txt”