On Jul 23, 2009, at 7:36 AM, vipin kumar wrote:
> I can't use blocking communication routines in my main program
> ( "masterprocess") because any type of network failure( may be due
> to physical connectivity or TCP connectivity or MPI connection as
> you told) may occur. So I am using non blocking point to point
> communication routines, and TEST later for completion of that
> Request. Once I enter a TEST loop I will test for Request complition
> till TIMEOUT. Suppose TIMEOUT has occured, In this case first I will
> check whether
Open MPI should return a failure if TCP connectivity is lost, even
with a non-blocking point-to-point operation. The failure should be
returned in the call to MPI_TEST (and friends). So I'm not sure your
timeout has meaning here -- if you reach the timeout, I think it
simply means that the MPI communication has not completed yet. It
does not necessarily mean that the MPI communication has failed.
> 1: Slave machine is reachable or not, (How I will do that ???
> Given - I have IP address and Host Name of Slave machine.)
> 2: if reachable, check whether program(orted and "slaveprocess")
> is alive or not.
MPI doesn't provide any standard way to check reachability and/or
health of a peer process.
That being said, I think some of the academics are working on more
fault tolerant / resilient MPI messaging, but I don't know if they're
ready to talk about such efforts publicly yet.
--
Jeff Squyres
jsquyres_at_[hidden]
|