Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] Disconnections
From: Daniel Miles (daniel.miles_at_[hidden])
Date: 2009-07-01 17:10:01

Hi, everybody.

I¹m having trouble where one of my client nodes crashes while I have an MPI
job on it. When this happens, the mpirun process on the head node never
returns. I can kill it with a SIGINT (ctrl-c) and it still cleans up its
child processes on the remaining healthy client nodes but I don¹t get any of
the results from those client processes.

Does anybody have any ideas about how I could create a more fault-tolerant
MPI job? In an ideal world, my head node would report that it lost the
connection to a client node and keep going as if that client never existed
(so that the results of the job are what they would have been if the
crashed-node wasn¹t part of the job to begin with).