Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] carto vs. hwloc
From: Joshua Hursey (jjhursey_at_[hidden])
Date: 2009-12-16 08:52:38


Currently, I am working on process migration and automatic recovery based on checkpoint/restart. WRT the PML stack, this works by rewiring the BTLs after restart of the migrated/recovered MPI process(es). There is a fair amount of work in getting this right with respect to both the runtime and the OMPI layer (particularly the modex). For the automatic recovery with C/R we will, at first, require the restart of all processes in the job [for consistency]. For migration, only those processes moving will need to be restarted, all others may be blocked.

I think what you are looking for is the ability to lose a process and replace it without restarting all the rest of the processes. This would require a bit more work beyond what I am currently working on. Since you will need to flush the PML/BML/BTL stack of latent messages, etc. The message logging work by UTK should do this anyway (if they use uncoordinated C/R+message logging), but they will have to fill in the details on that project.

-- Josh

On Dec 16, 2009, at 1:32 AM, George Bosilca wrote:

> As far as I know what Josh did is slightly different. In the case of a complete restart (where all processes are restarted from a checkpoint), he setup and rewire a new set of BTLs.
>
> However, it happens that we do have some code to rewire the MPI processes in case of failure(s) in one of UTK projects. I'll have to talk with the team here, to see if at this point there is something we can contribute regarding this matter.
>
> george.
>
> On Dec 15, 2009, at 21:08 , Ralph Castain wrote:
>
>>
>> On Dec 15, 2009, at 6:31 PM, Jeff Squyres wrote:
>>
>>> On Dec 15, 2009, at 2:20 PM, Ralph Castain wrote:
>>>
>>>> It probably should be done at a lower level, but it begs a different question. For example, I've created the capability in the new cluster manager to detect interfaces that are lost, ride through the problem by moving affected procs to other nodes (reconnecting ORTE-level comm), and move procs back if/when nodes reappear. So someone can remove a node "on-the-fly" and replace that hardware with another node without having to stop and restart the job, etc. A lot of that infrastructure is now down inside ORTE, though a few key pieces remain in the ORCM code base (and most likely will stay there).
>>>>
>>>> Works great - unless it is an MPI job. If we can figure out a way for the MPI procs to (a) be properly restarted on the "new" node, and (b) update the BTL connection info on the other MPI procs in the job, then we would be good to go...
>>>>
>>>> Trivial problem, I am sure :-)
>>>
>>> ...actually, the groundwork is there with Josh's work, isn't it? I think the real issue is handling un-graceful BTL failures properly. I'm guessing that's the biggest piece that isn't done...?
>>
>> Think so....not sure how to update the BTL's with the new info, but perhaps Josh has already done that problem.
>>
>>>
>>> --
>>> Jeff Squyres
>>> jsquyres_at_[hidden]
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel