Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] allow job to survive process death
From: Ralph Castain (rhc_at_[hidden])
Date: 2011-01-27 09:56:32


On Jan 27, 2011, at 7:47 AM, Reuti wrote:

> Am 27.01.2011 um 15:23 schrieb Joshua Hursey:
>
>> The current version of Open MPI does not support continued operation of an MPI application after process failure within a job. If a process dies, so will the MPI job. Note that this is true of many MPI implementations out there at the moment.
>>
>> At Oak Ridge National Laboratory, we are working on a version of Open MPI that will be able to run-through process failure, if the application wishes to do so. The semantics and interfaces needed to support this functionality are being actively developed by the MPI Forums Fault Tolerance Working Group, and can be found at the wiki page below:
>> https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/ft/run_through_stabilization
>
> I had a look at this document, but what is really covered - the application has to react on the notification of a failed rank and act appropriate on its own?
>
> Having a true ability to survive a dying process (i.e. rank) which might be computing already for hours would mean to have some kind of "rank RAID" or "rank Parchive". E.g. start 12 ranks when you need 10 - what ever 2 ranks are failing, your job will be ready in time.

We have the run-time part of this done - of course, figuring out the MPI part of the problem is harder ;-)

>
> -- Reuti
>
>
>> This work is on-going, but once we have a stable prototype we will assess how to bring it back to the mainline Open MPI trunk. For the moment, there is no public release of this branch, but once there is we will be sure to announce it on the appropriate Open MPI mailing list for folks to start playing around with it.
>>
>> -- Josh
>>
>> On Jan 27, 2011, at 9:11 AM, Kirk Stako wrote:
>>
>>> Hi,
>>>
>>> I was wondering what support Open MPI has for allowing a job to
>>> continue running when one or more processes in the job die
>>> unexpectedly? Is there a special mpirun flag for this? Any other ways?
>>>
>>> It seems obvious that collectives will fail once a process dies, but
>>> would it be possible to create a new group (if you knew which ranks
>>> are dead) that excludes the dead processes - then turn this group into
>>> a working communicator?
>>>
>>> Thanks,
>>> Kirk
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>> ------------------------------------
>> Joshua Hursey
>> Postdoctoral Research Associate
>> Oak Ridge National Laboratory
>> http://users.nccs.gov/~jjhursey
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users