Dear Group,
I have a mpi-program in which every process is communicating with its neighbors. When SELF-checkpointing, every process writes to a separate file.Problem is that sometimes after making a checkpoint, program does not continue again. Having more number of processes makes this problem severe.With just 1 process (no communication), SEFL-checkpointing works normally with no problem.I have tried different '--mca btl' parameters (openib,tcp,sm,self), but problem persists.I would very much appreciate your support regarding it.
Kind regards,Faisal
|