Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI devel] Change in communication between process (RMAPS)
From: Hugo Meyer (meyer.hugo_at_[hidden])
Date: 2011-01-07 07:29:33


Thanks Josh and Jeff.

I have read already "The Design and Implementation of Checkpoint/Restart
Process Fault Tolerance for Open MPI" and i'm gonna read now "A Composable
Runtime Recovery Policy Framework Supporting Resilient HPC Applications" and
then i will take a look to the code of the components that you mention, and
i will let you know how things are going.

Thanks a lot.

Hugo Meyer

2011/1/6 Joshua Hursey <jjhursey_at_[hidden]>

> So I can point you to some of the work that I did while at Indiana
> University to support process migration in Open MPI in a coordinated manner.
> This should introduce you to some of the internal pieces that fit together
> to provide this support.
>
> The transparent C/R in Open MPI webpage from IU is a good place to start:
> http://osl.iu.edu/research/ft/ompi-cr/index.php
>
> >From there you will find a link to a couple papers that should get you
> started. In particular "A Composable Runtime Recovery Policy Framework
> Supporting Resilient HPC Applications" discusses how the ORTE ErrMgr
> framework was used (initially) to provide process migration and automatic
> recovery. The actual code in the Open MPI trunk is slightly different.
> Instead of using different components of the ErrMgr framework (i.e., autor,
> crmig, stable) we just rolled it all into the existing components (i.e.,
> hnp, orted, app). But all the code can be found in those component
> directories.
>
> If you want a more general overview of the C/R system in Open MPI, I would
> start with the paper "The Design and Implementation of Checkpoint/Restart
> Process Fault Tolerance for Open MPI" which provides a high level view of
> the architecture (combined with the paper above you will have a fairly
> complete picture of the design). The C/R infrastructure currently only
> supports coordinated C/R, but was designed to be more extensible. So if you
> are looking into uncoordinated C/R techniques you may find that many of the
> C/R frameworks in Open MPI can be reused.
>
> That should get you started. Let us know if you have any further questions.
>
> -- Josh
>
> On Jan 6, 2011, at 3:19 PM, Hugo Meyer wrote:
>
> > Thanks for the reply and don't worry about the delay.
> >
> > Yeah, i supposse it wouln't be easy :(.
> > But my final goal is what you are mentioning, is to stop one particular
> process (previously checkpointed) and the migrate it to another place (node,
> core, slot, etc.) and restart it there, but without making a coordinated
> checkpoint. I just need to checkpoint processes in an uncoordinated way, and
> move them.
> >
> > Where can i see something about process migration in the code? or
> something that could guide me.
> >
> > Greetings.
> >
> > Hugo Meyer
> >
> > 2011/1/6 Jeff Squyres <jsquyres_at_[hidden]>
> > Sorry for the delay; you wrote while many of us were on vacation and
> we're just now starting to catch up on past mails...
> >
> > I'm not entirely sure what you're trying to do. It sounds like you're
> trying to replace one process with another. That's quite complicated; there
> will be a lot of changes required in the code base to do this.
> >
> > - you'll need to notify the ORTE subsystem of the process change
> > - this notification will likely need to span multiple processes
> > - all MPI processes will need to quiesce their communications,
> disconnect, and reconnect
> > - ...and probably other things
> >
> > That being said, you might be able to leverage some of the work that's
> been done with checkpoint/restart/migration. It's not entirely the same
> thing that you're doing, but it's at least similar (quiesce networks,
> [pretend to] move a process from location A to location B, etc.).
> >
> >
> >
> > On Dec 28, 2010, at 7:03 AM, Hugo Meyer wrote:
> >
> > > Hello to all.
> > >
> > > I'm new in the forum, at least is the first time i write.
> > >
> > > I'm working with open mpi and I would do a little experiment, i will
> try to pass one process by another process.
> > >
> > > For example, assuming that there are 2 processes that are communicating
> say rank 1 and 2. And there is a process of rank 3, I would like the rank 3
> (it could be assumed that this node is marked down at the initial hostfile)
> took the place of rank 2, and rank 1 still think that he is communicating
> with rank 2 when in fact is communicating with the rank 3.
> > >
> > > I guess I'll have to modify tables as orte_job_map_t and orte_proc_t,
> but I wanted to know if someone already has experience doing something
> similar, and can guide me at least.
> > >
> > > The communication between processes, in principle, would be irrelevant,
> so i will not need to use checkpoints / restarts for now.
> > >
> > > Greetings
> > >
> > > Hugo Meyer
> > > _______________________________________________
> > > devel mailing list
> > > devel_at_[hidden]
> > > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >
> >
> > --
> > Jeff Squyres
> > jsquyres_at_[hidden]
> > For corporate legal information go to:
> > http://www.cisco.com/web/about/doing_business/legal/cri/
> >
> >
> > _______________________________________________
> > devel mailing list
> > devel_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >
> > _______________________________________________
> > devel mailing list
> > devel_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> ------------------------------------
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://users.nccs.gov/~jjhursey
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>