Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] mpiexec option for node failure
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2011-09-16 08:55:45


Though I do not share George's pessimism about acceptance to the Open
MPI community, it has been slightly difficult to add such a
non-standard feature to the code base for various reasons.

At ORNL, I have been developing a prototype for the MPI Forum Fault
Tolerance Working Group [1] of the Run-Through Stabilization proposal
[2,3]. This would allow the application to continue running and using
MPI functions even though processes fail during execution. We have
been doing some limited alpha releases for some friendly application
developers desiring to play with the prototype for a while now. We are
hoping to do a more public beta release in the coming months. I'll
likely post a message to the ompi-devel list once it is ready.

-- Josh

[1] http://svn.mpi-forum.org/trac/mpi-forum-web/wiki/FaultToleranceWikiPage
[2] See PDF on https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/ft/run_through_stabilization
[3] See PDF on https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/ft/run_through_stabilization_2

On Thu, Sep 15, 2011 at 4:14 PM, George Bosilca <bosilca_at_[hidden]> wrote:
> Rob,
>
> The Open MPI community did consider such as option, but it deemed it as uninteresting. However, we (UTK team) have a patched version supporting several fault tolerant modes, including the one you described in your email. If you are interested please contact me directly.
>
>  Thanks,
>    george.
>
>
> On Sep 12, 2011, at 20:43 , Ralph Castain wrote:
>
>> We don't have anything similar in OMPI. There are fault tolerance modes, but not like the one you describe.
>>
>> On Sep 12, 2011, at 5:52 PM, Rob Stewart wrote:
>>
>>> Hi,
>>>
>>> I have implemented a simple fault tolerant ping pong C program with MPI, here: http://pastebin.com/7mtmQH2q
>>>
>>> MPICH2 offers a parameter with mpiexec:
>>> $ mpiexec -disable-auto-cleanup
>>>
>>> .. as described here: http://trac.mcs.anl.gov/projects/mpich2/ticket/1421
>>>
>>> It is fault tolerant in the respect that, when I ssh to one of the nodes in the hosts file, and kill the relevant process, the MPI job is not terminated. Simply, the ping will not prompt a pong from the dead node, but the ping-pong runs forever on the remaining live nodes.
>>>
>>> Is such an feature available for openMPI, either via mpiexec or some other means?
>>>
>>>
>>> --
>>> Rob Stewart
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>

-- 
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey