In a word, no. If a node crashes, OMPI will abort the currently-running job if it had processes on that node. There is no current ability to "ride-thru" such an event.
Dear users,Our cluster has a number of nodes which have high probability to crash, so it happens quite often that calculations stop due to one node getting down. May be you know if it is possible to block the crashed nodes during run-time when running with OpenMPI? I am asking about principal possibility to program such behavior. Does OpenMPI allow such dynamic checking? The scheme I am curious about is the following:1. A code starts its tasks via mpirun on several nodes2. At some moment one node gets down3. The code realizes that the node is down (the results are lost) and excludes it from the list of nodes to run its tasks on4. At later moment the user restarts the crashed node5. The code notices that the node is up again, and puts it back to the list of active nodesRegards,Andrei
_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users