Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] carto vs. hwloc
From: Kenneth Lloyd (kenneth.lloyd_at_[hidden])
Date: 2009-12-16 07:00:41

> -----Original Message-----
> From: devel-bounces_at_[hidden]
> [mailto:devel-bounces_at_[hidden]] On Behalf Of Jeff Squyres
> Sent: Tuesday, December 15, 2009 6:32 PM
> To: Open MPI Developers
> Subject: Re: [OMPI devel] carto vs. hwloc
> On Dec 15, 2009, at 2:20 PM, Ralph Castain wrote:
> > It probably should be done at a lower level, but it begs a
> different question. For example, I've created the capability
> in the new cluster manager to detect interfaces that are
> lost, ride through the problem by moving affected procs to
> other nodes (reconnecting ORTE-level comm), and move procs
> back if/when nodes reappear. So someone can remove a node
> "on-the-fly" and replace that hardware with another node
> without having to stop and restart the job, etc. A lot of
> that infrastructure is now down inside ORTE, though a few key
> pieces remain in the ORCM code base (and most likely will stay there).
> >
> > Works great - unless it is an MPI job. If we can figure out
> a way for the MPI procs to (a) be properly restarted on the
> "new" node, and (b) update the BTL connection info on the
> other MPI procs in the job, then we would be good to go...
> >
> > Trivial problem, I am sure :-)
> ...actually, the groundwork is there with Josh's work, isn't
> it? I think the real issue is handling un-graceful BTL
> failures properly. I'm guessing that's the biggest piece
> that isn't done...?

Precisely. Why the BTL, or why not at the PTL? (Where these issues rightly
belong, IMO).

Ken Lloyd

> --
> Jeff Squyres
> jsquyres_at_[hidden]
> _______________________________________________
> devel mailing list
> devel_at_[hidden]