Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Network connection check
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-07-23 08:06:56

On Jul 23, 2009, at 7:36 AM, vipin kumar wrote:

> I can't use blocking communication routines in my main program
> ( "masterprocess") because any type of network failure( may be due
> to physical connectivity or TCP connectivity or MPI connection as
> you told) may occur. So I am using non blocking point to point
> communication routines, and TEST later for completion of that
> Request. Once I enter a TEST loop I will test for Request complition
> till TIMEOUT. Suppose TIMEOUT has occured, In this case first I will
> check whether

Open MPI should return a failure if TCP connectivity is lost, even
with a non-blocking point-to-point operation. The failure should be
returned in the call to MPI_TEST (and friends). So I'm not sure your
timeout has meaning here -- if you reach the timeout, I think it
simply means that the MPI communication has not completed yet. It
does not necessarily mean that the MPI communication has failed.

> 1: Slave machine is reachable or not, (How I will do that ???
> Given - I have IP address and Host Name of Slave machine.)

> 2: if reachable, check whether program(orted and "slaveprocess")
> is alive or not.

MPI doesn't provide any standard way to check reachability and/or
health of a peer process.

That being said, I think some of the academics are working on more
fault tolerant / resilient MPI messaging, but I don't know if they're
ready to talk about such efforts publicly yet.

Jeff Squyres