Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Question regarding SELF-checkpointing
From: Faisal Shahzad (itsfaisi_at_[hidden])
Date: 2011-09-01 10:55:04


Hi,
My version of OpenMPI is 1.5.3.I have attached a simple toy code. Which is actually modification of example given on the web http://osl.iu.edu/research/ft/ompi-cr/examples.php .Mainly, i introduced some communication between processes and every process writes its separate checkpoint file.Here is my command line for running the job. >> mpirun -np 48 -npernode 6 -bind-to-core -bycore -am ft-enable-cr --mca crs_self_prefix my_personal ./SELF_CR 5000
Also i have attached another file containing my MCA options from 'ompi_info'.
In this toy-code, problem is not too severe, so i used 48 or even 96 processes and many checkpoints to make problem appear. But i my actual code, perhaps due to more MPI calls, sometimes problem occur even within one node with only few (2-5) processes as well.
Hope to hear from you.Kind regards,Faisal Shahzad

> Date: Wed, 31 Aug 2011 11:35:55 -0400
> From: jjhursey_at_[hidden]
> To: users_at_[hidden]
> Subject: Re: [OMPI users] Question regarding SELF-checkpointing
>
> That seems like a bug to me.
>
> What version of Open MPI are you using? How have you setup the C/R
> functionality (what MCA options do you have set, what command line
> options are you using)? Can you send a small reproducing application
> that we can test against?
>
> That should help us focus in on the problem a bit.
>
> -- Josh
>
> On Wed, Aug 31, 2011 at 6:36 AM, Faisal Shahzad <itsfaisi_at_[hidden]> wrote:
> > Dear Group,
> > I have a mpi-program in which every process is communicating with its
> > neighbors. When SELF-checkpointing, every process writes to a separate file.
> > Problem is that sometimes after making a checkpoint, program does not
> > continue again. Having more number of processes makes this problem severe.
> > With just 1 process (no communication), SEFL-checkpointing works normally
> > with no problem.
> > I have tried different '--mca btl' parameters (openib,tcp,sm,self), but
> > problem persists.
> > I would very much appreciate your support regarding it.
> > Kind regards,
> > Faisal
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
>
>
>
> --
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://users.nccs.gov/~jjhursey
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users