Actually, I honestly don't remember even having that discussion. In looking at it, this would be relatively easy to implement if someone really wanted it.
Only issue: user would bear full responsibility for OMPI not cleaning up failed jobs since we wouldn't terminate upon seeing a proc fail. Definitely not something you'd want to do in production!
On Sep 16, 2011, at 6:55 AM, Josh Hursey wrote:
> Though I do not share George's pessimism about acceptance to the Open
> MPI community, it has been slightly difficult to add such a
> non-standard feature to the code base for various reasons.
> At ORNL, I have been developing a prototype for the MPI Forum Fault
> Tolerance Working Group  of the Run-Through Stabilization proposal
> [2,3]. This would allow the application to continue running and using
> MPI functions even though processes fail during execution. We have
> been doing some limited alpha releases for some friendly application
> developers desiring to play with the prototype for a while now. We are
> hoping to do a more public beta release in the coming months. I'll
> likely post a message to the ompi-devel list once it is ready.
> -- Josh
>  http://svn.mpi-forum.org/trac/mpi-forum-web/wiki/FaultToleranceWikiPage
>  See PDF on https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/ft/run_through_stabilization
>  See PDF on https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/ft/run_through_stabilization_2
> On Thu, Sep 15, 2011 at 4:14 PM, George Bosilca <bosilca_at_[hidden]> wrote:
>> The Open MPI community did consider such as option, but it deemed it as uninteresting. However, we (UTK team) have a patched version supporting several fault tolerant modes, including the one you described in your email. If you are interested please contact me directly.
>> On Sep 12, 2011, at 20:43 , Ralph Castain wrote:
>>> We don't have anything similar in OMPI. There are fault tolerance modes, but not like the one you describe.
>>> On Sep 12, 2011, at 5:52 PM, Rob Stewart wrote:
>>>> I have implemented a simple fault tolerant ping pong C program with MPI, here: http://pastebin.com/7mtmQH2q
>>>> MPICH2 offers a parameter with mpiexec:
>>>> $ mpiexec -disable-auto-cleanup
>>>> .. as described here: http://trac.mcs.anl.gov/projects/mpich2/ticket/1421
>>>> It is fault tolerant in the respect that, when I ssh to one of the nodes in the hosts file, and kill the relevant process, the MPI job is not terminated. Simply, the ping will not prompt a pong from the dead node, but the ping-pong runs forever on the remaining live nodes.
>>>> Is such an feature available for openMPI, either via mpiexec or some other means?
>>>> Rob Stewart
>>>> users mailing list
>>> users mailing list
>> users mailing list
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> users mailing list