In a word, no. If a node crashes, OMPI will abort the currently-running job
if it had processes on that node. There is no current ability to "ride-thru"
such an event.
That said, there is work being done to support "ride-thru". Most of that is
in the current developer's code trunk, and more is coming, but I wouldn't
consider it production-quality just yet.
Specifically, the code that does what you specify below is done and works.
It is recovery of the MPI job itself (collectives, lost messages, etc.) that
remains to be completed.
On Thu, Sep 23, 2010 at 7:22 AM, Andrei Fokau
> Dear users,
> Our cluster has a number of nodes which have high probability to crash, so
> it happens quite often that calculations stop due to one node getting down.
> May be you know if it is possible to block the crashed nodes during run-time
> when running with OpenMPI? I am asking about principal possibility to
> program such behavior. Does OpenMPI allow such dynamic checking? The scheme
> I am curious about is the following:
> 1. A code starts its tasks via mpirun on several nodes
> 2. At some moment one node gets down
> 3. The code realizes that the node is down (the results are lost) and
> excludes it from the list of nodes to run its tasks on
> 4. At later moment the user restarts the crashed node
> 5. The code notices that the node is up again, and puts it back to the list
> of active nodes
> users mailing list