I’m having trouble where one of my client nodes crashes while I have an MPI job on it. When this happens, the mpirun process on the head node never returns. I can kill it with a SIGINT (ctrl-c) and it still cleans up its child processes on the remaining healthy client nodes but I don’t get any of the results from those client processes.
Does anybody have any ideas about how I could create a more fault-tolerant MPI job? In an ideal world, my head node would report that it lost the connection to a client node and keep going as if that client never existed (so that the results of the job are what they would have been if the crashed-node wasn’t part of the job to begin with).