Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Running on crashing nodes
From: Andrei Fokau (andrei.fokau_at_[hidden])
Date: 2010-09-24 03:37:05


Ralph, could you tell us when this functionality will be available in the
stable version? A rough estimate will be fine.

On Fri, Sep 24, 2010 at 01:24, Ralph Castain <rhc_at_[hidden]> wrote:

> In a word, no. If a node crashes, OMPI will abort the currently-running job
> if it had processes on that node. There is no current ability to "ride-thru"
> such an event.
>
> That said, there is work being done to support "ride-thru". Most of that is
> in the current developer's code trunk, and more is coming, but I wouldn't
> consider it production-quality just yet.
>
> Specifically, the code that does what you specify below is done and works.
> It is recovery of the MPI job itself (collectives, lost messages, etc.) that
> remains to be completed.
>
>
> On Thu, Sep 23, 2010 at 7:22 AM, Andrei Fokau <
> andrei.fokau_at_[hidden]> wrote:
>
>> Dear users,
>>
>> Our cluster has a number of nodes which have high probability to crash, so
>> it happens quite often that calculations stop due to one node getting down.
>> May be you know if it is possible to block the crashed nodes during run-time
>> when running with OpenMPI? I am asking about principal possibility to
>> program such behavior. Does OpenMPI allow such dynamic checking? The scheme
>> I am curious about is the following:
>>
>> 1. A code starts its tasks via mpirun on several nodes
>> 2. At some moment one node gets down
>> 3. The code realizes that the node is down (the results are lost) and
>> excludes it from the list of nodes to run its tasks on
>> 4. At later moment the user restarts the crashed node
>> 5. The code notices that the node is up again, and puts it back to the
>> list of active nodes
>>
>>
>> Regards,
>> Andrei
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>