Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Disconnections
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-07-01 18:00:47

On Jul 1, 2009, at 3:10 PM, Daniel Miles wrote:

> Hi, everybody.
> I’m having trouble where one of my client nodes crashes while I have
> an MPI job on it. When this happens, the mpirun process on the head
> node never returns.

This shouldn't happen - we should cleanly abort. What version are you

> I can kill it with a SIGINT (ctrl-c) and it still cleans up its
> child processes on the remaining healthy client nodes but I don’t
> get any of the results from those client processes.

At the moment, we sigterm the remaining healthy children when you ctrl-
c. I do believe that Rolf (Sun) put some code in our development trunk
that first hits the procs with a signal that they can catch to cleanup
before being whacked, but that isn't in a release yet (assuming I
remember it right anyway). If I'm mis-remembering, I can certainly add
that capability.

Sounds like something we should do, assuming the MPI std allows it
(and mechanics work out).

> Does anybody have any ideas about how I could create a more fault-
> tolerant MPI job? In an ideal world, my head node would report that
> it lost the connection to a client node and keep going as if that
> client never existed (so that the results of the job are what they
> would have been if the crashed-node wasn’t part of the job to begin
> with).

That would be nice...but I'm not sure anyone knows how to do that
right now. The problem is that MPI operations involving ranks on that
client node will suddenly hang without warning, and there is no way to
know that something is wrong.

There is work going on to enable what you describe, but it is still in
the research phase.

> _______________________________________________
> users mailing list
> users_at_[hidden]