On Jul 1, 2009, at 3:10 PM, Daniel Miles wrote:
> Hi, everybody.
> Im having trouble where one of my client nodes crashes while I have
> an MPI job on it. When this happens, the mpirun process on the head
> node never returns.
This shouldn't happen - we should cleanly abort. What version are you
> I can kill it with a SIGINT (ctrl-c) and it still cleans up its
> child processes on the remaining healthy client nodes but I dont
> get any of the results from those client processes.
At the moment, we sigterm the remaining healthy children when you ctrl-
c. I do believe that Rolf (Sun) put some code in our development trunk
that first hits the procs with a signal that they can catch to cleanup
before being whacked, but that isn't in a release yet (assuming I
remember it right anyway). If I'm mis-remembering, I can certainly add
Sounds like something we should do, assuming the MPI std allows it
(and mechanics work out).
> Does anybody have any ideas about how I could create a more fault-
> tolerant MPI job? In an ideal world, my head node would report that
> it lost the connection to a client node and keep going as if that
> client never existed (so that the results of the job are what they
> would have been if the crashed-node wasnt part of the job to begin
That would be nice...but I'm not sure anyone knows how to do that
right now. The problem is that MPI operations involving ranks on that
client node will suddenly hang without warning, and there is no way to
know that something is wrong.
There is work going on to enable what you describe, but it is still in
the research phase.
> users mailing list