I have a mpi-program in which every process is communicating with its neighbors. When SELF-checkpointing, every process writes to a separate file.
Problem is that sometimes after making a checkpoint, program does not continue again. Having more number of processes makes this problem severe.
With just 1 process (no communication), SEFL-checkpointing works normally with no problem.
I have tried different '--mca btl' parameters (openib,tcp,sm,self), but problem persists.
I would very much appreciate your support regarding it.