On Jul 1, 2009, at 3:10 PM, Daniel Miles wrote:

Hi, everybody.

Iím having trouble where one of my client nodes crashes while I have an MPI job on it. When this happens, the mpirun process on the head node never returns.

This shouldn't happen - we should cleanly abort. What version are you using?

I can kill it with a SIGINT (ctrl-c) and it still cleans up its child processes on the remaining healthy client nodes but I donít get any of the results from those client processes.

At the moment, we sigterm the remaining healthy children when you ctrl-c. I do believe that Rolf (Sun) put some code in our development trunk that first hits the procs with a signal that they can catch to cleanup before being whacked, but that isn't in a release yet (assuming I remember it right anyway). If I'm mis-remembering, I can certainly add that capability.

Sounds like something we should do, assuming the MPI std allows it (and mechanics work out).

Does anybody have any ideas about how I could create a more fault-tolerant MPI job? In an ideal world, my head node would report that it lost the connection to a client node and keep going as if that client never existed (so that the results of the job are what they would have been if the crashed-node wasnít part of the job to begin with).

That would be nice...but I'm not sure anyone knows how to do that right now. The problem is that MPI operations involving ranks on that client node will suddenly hang without warning, and there is no way to know that something is wrong.

There is work going on to enable what you describe, but it is still in the research phase.

users mailing list