On Thu, Jul 23, 2009 at 3:03 PM, Ralph Castain <rhc_at_[hidden]> wrote:
> It depends on which network fails. If you lose all TCP connectivity, Open
> MPI should abort the job as the out-of-band system will detect the loss of
> connection. If you only lose the MPI connection (whether TCP or some other
> interconnect), then I believe the system will eventually generate an error
> after it retries sending the message a specified number of times, though it
> may not abort.
Thank you Ralph,
>From your reply I came to know that the question I posted earlier was not
reflecting the problem properly.
I can't use blocking communication routines in my main program (
"masterprocess") because any type of network failure( may be due to physical
connectivity or TCP connectivity or MPI connection as you told) may occur.
So I am using non blocking point to point communication routines, and TEST
later for completion of that Request. Once I enter a TEST loop I will test
for Request complition till TIMEOUT. Suppose TIMEOUT has occured, In this
case first I will check whether
1: Slave machine is reachable or not, (How I will do that ??? Given - I
have IP address and Host Name of Slave machine.)
2: if reachable, check whether program(orted and "slaveprocess") is alive
I don't want to abort my master process in case 1 and hope that network
connection will come up in future. Fortunately OpenMPI doesn't abort any
process. Both processes can run independently without communicating.
Thanks and Regards,