Am 08.03.2012 um 19:02 schrieb Linton, Tom:
> We have a legacy application that runs fine on our cluster using Intel MPI. We ported it to OpenMPI so that we could use BLCR and it runs fine but checkpointing is not working properly:
> 1. when we checkpoint with more than 1 core, it executes with the error:
> mpirun noticed that process rank 1 with PID 8260 on node tscco28017 exited on signal 4 (Illegal instruction).
was the appication and Open MPI compiled on one and the same machine and the cpu type is the same across the involved nodes?
> 2. checkpointing with 1 core works
> 3. we have a simple test program that exercises MPI with multiple cores and it checkpoints fine on multiple cores
> Can you suggest how to debug this problem?
> Some additional information:
> 1. I execute the program like this: mpirun -am ft-enable-cr -n 2 -machinefile machines program inputfile
> 2. when I checkpoint it, I see that the checkpoint directories are created but the file global_snapshot_meta.data is not complete, there is no restart-appfile, the snapshot_meta.data files are not complete, and there are no dump files for the individual processes.
> 3. the command ompi-checkpoint doesnt return; I have to control-C to kill it after checkpointing.
> 4. We are using Open MPI 1.4.4 with BLCR 0.8.4
> 5. I have attached ompi_info.txt
> users mailing list