Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] mpiexec option for node failure
From: Ralph Castain (rhc_at_[hidden])
Date: 2011-09-16 10:28:19


Actually, I honestly don't remember even having that discussion. In looking at it, this would be relatively easy to implement if someone really wanted it.

Only issue: user would bear full responsibility for OMPI not cleaning up failed jobs since we wouldn't terminate upon seeing a proc fail. Definitely not something you'd want to do in production!

On Sep 16, 2011, at 6:55 AM, Josh Hursey wrote:

> Though I do not share George's pessimism about acceptance to the Open
> MPI community, it has been slightly difficult to add such a
> non-standard feature to the code base for various reasons.
>
> At ORNL, I have been developing a prototype for the MPI Forum Fault
> Tolerance Working Group [1] of the Run-Through Stabilization proposal
> [2,3]. This would allow the application to continue running and using
> MPI functions even though processes fail during execution. We have
> been doing some limited alpha releases for some friendly application
> developers desiring to play with the prototype for a while now. We are
> hoping to do a more public beta release in the coming months. I'll
> likely post a message to the ompi-devel list once it is ready.
>
> -- Josh
>
> [1] http://svn.mpi-forum.org/trac/mpi-forum-web/wiki/FaultToleranceWikiPage
> [2] See PDF on https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/ft/run_through_stabilization
> [3] See PDF on https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/ft/run_through_stabilization_2
>
> On Thu, Sep 15, 2011 at 4:14 PM, George Bosilca <bosilca_at_[hidden]> wrote:
>> Rob,
>>
>> The Open MPI community did consider such as option, but it deemed it as uninteresting. However, we (UTK team) have a patched version supporting several fault tolerant modes, including the one you described in your email. If you are interested please contact me directly.
>>
>> Thanks,
>> george.
>>
>>
>> On Sep 12, 2011, at 20:43 , Ralph Castain wrote:
>>
>>> We don't have anything similar in OMPI. There are fault tolerance modes, but not like the one you describe.
>>>
>>> On Sep 12, 2011, at 5:52 PM, Rob Stewart wrote:
>>>
>>>> Hi,
>>>>
>>>> I have implemented a simple fault tolerant ping pong C program with MPI, here: http://pastebin.com/7mtmQH2q
>>>>
>>>> MPICH2 offers a parameter with mpiexec:
>>>> $ mpiexec -disable-auto-cleanup
>>>>
>>>> .. as described here: http://trac.mcs.anl.gov/projects/mpich2/ticket/1421
>>>>
>>>> It is fault tolerant in the respect that, when I ssh to one of the nodes in the hosts file, and kill the relevant process, the MPI job is not terminated. Simply, the ping will not prompt a pong from the dead node, but the ping-pong runs forever on the remaining live nodes.
>>>>
>>>> Is such an feature available for openMPI, either via mpiexec or some other means?
>>>>
>>>>
>>>> --
>>>> Rob Stewart
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>
>
>
> --
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://users.nccs.gov/~jjhursey
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users