Also i have attached another file containing my MCA options from 'ompi_info'.
In this toy-code, problem is not too severe, so i used 48 or even 96 processes and many checkpoints to make problem appear. But i my actual code, perhaps due to more MPI calls, sometimes problem occur even within one node with only few (2-5) processes as well.
Hope to hear from you.
> Date: Wed, 31 Aug 2011 11:35:55 -0400 > From: email@example.com > To: firstname.lastname@example.org > Subject: Re: [OMPI users] Question regarding SELF-checkpointing > > That seems like a bug to me. > > What version of Open MPI are you using? How have you setup the C/R > functionality (what MCA options do you have set, what command line > options are you using)? Can you send a small reproducing application > that we can test against? > > That should help us focus in on the problem a bit. > > -- Josh > > On Wed, Aug 31, 2011 at 6:36 AM, Faisal Shahzad <email@example.com> wrote: > > Dear Group, > > I have a mpi-program in which every process is communicating with its > > neighbors. When SELF-checkpointing, every process writes to a separate file. > > Problem is that sometimes after making a checkpoint, program does not > > continue again. Having more number of processes makes this problem severe. > > With just 1 process (no communication), SEFL-checkpointing works normally > > with no problem. > > I have tried different '--mca btl' parameters (openib,tcp,sm,self), but > > problem persists. > > I would very much appreciate your support regarding it. > > Kind regards, > > Faisal > > _______________________________________________ > > users mailing list > > firstname.lastname@example.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > -- > Joshua Hursey > Postdoctoral Research Associate > Oak Ridge National Laboratory > http://users.nccs.gov/~jjhursey > _______________________________________________ > users mailing list > email@example.com > http://www.open-mpi.org/mailman/listinfo.cgi/users