Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk
From: Ralph Castain (rhc_at_[hidden])
Date: 2010-02-25 21:04:00


Just to add to Josh's comment: I am working now on recovering from HNP
failure as well. Should have that in a month or so.

On Thu, Feb 25, 2010 at 10:46 AM, Josh Hursey <jjhursey_at_[hidden]> wrote:

>
> On Feb 25, 2010, at 8:32 AM, George Bosilca wrote:
>
> >
> > On Feb 25, 2010, at 11:16 , Leonardo Fialho wrote:
> >
> >> Hum... I'm really afraid about this. I understand your choice since it
> is really a good solution for fail/stop/restart behaviour, but looking from
> the fail/recovery side, can you envision some alternative for the orted's
> reconfiguration on the fly?
> >
> > Leonardo,
> >
> > I don't see why the current code prohibit such behavior. However, I don't
> see right now in this branch how the remaining daemons (and MPI processes)
> reconstruct the communication topology, but this is just a technicality.
>
> If you use the 'cm' routed component then the reconstruction of the ORTE
> level communication works for all but the loss of the HNP. Neither Ralph or
> I have looked at supporting other routed components at this time. I know
> your group at UTK has some done work in this area so we wanted to tackle
> additional support for more scalable routed components as a second step,
> hopefully with collaboration from your group.
>
> As far as the MPI layer, I can't say much at this point on how that works.
> This RFC only handles recovery of the ORTE layer, MPI layer recovery is a
> second step and involves much longer discussions. I have a solution for a
> certain type of MPI application, and it sounds like you have something that
> can be applied more generally.
>
> >
> > Anyway, this is the code that UT will bring in. All our work focus on
> maintaining the exiting environment up and running instead of restarting
> everything. The orted will auto-heal (i.e reshape the underlying topology,
> recreate the connections, and so on), and the fault is propagated to the MPI
> layer who will take the decision on what to do next.
>
> Per my previous suggestion, would it be useful to chat on the phone early
> next week about our various strategies?
>
> -- Josh
>
>
> >
> > george.
> >
> >
> >
> > _______________________________________________
> > devel mailing list
> > devel_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>