I¹m having trouble where one of my client nodes crashes while I have an MPI
job on it. When this happens, the mpirun process on the head node never
returns. I can kill it with a SIGINT (ctrl-c) and it still cleans up its
child processes on the remaining healthy client nodes but I don¹t get any of
the results from those client processes.
Does anybody have any ideas about how I could create a more fault-tolerant
MPI job? In an ideal world, my head node would report that it lost the
connection to a client node and keep going as if that client never existed
(so that the results of the job are what they would have been if the
crashed-node wasn¹t part of the job to begin with).