Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Illegal Instruction on Checkpoint with BLCR
From: Reuti (reuti_at_[hidden])
Date: 2012-03-08 13:18:39


Hi,

Am 08.03.2012 um 19:02 schrieb Linton, Tom:

> We have a legacy application that runs fine on our cluster using Intel MPI. We ported it to OpenMPI so that we could use BLCR and it runs fine but checkpointing is not working properly:
>
> 1. when we checkpoint with more than 1 core, it executes with the error:
> mpirun noticed that process rank 1 with PID 8260 on node tscco28017 exited on signal 4 (Illegal instruction).

was the appication and Open MPI compiled on one and the same machine and the cpu type is the same across the involved nodes?

-- Reuti

> 2. checkpointing with 1 core works
> 3. we have a simple test program that exercises MPI with multiple cores and it checkpoints fine on multiple cores
>
> Can you suggest how to debug this problem?
>
> Some additional information:
>
> 1. I execute the program like this: mpirun -am ft-enable-cr -n 2 -machinefile machines program inputfile
> 2. when I checkpoint it, I see that the checkpoint directories are created but the file “global_snapshot_meta.data” is not complete, there is no restart-appfile, the “snapshot_meta.data” files are not complete, and there are no dump files for the individual processes.
> 3. the command “ompi-checkpoint” doesn’t return; I have to control-C to kill it after checkpointing.
> 4. We are using Open MPI 1.4.4 with BLCR 0.8.4
> 5. I have attached “ompi_info.txt”
>
> Thanks
> Tom
>
> <ompi_info.txt>_______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users