Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Network connection check
From: vipin kumar (vipinkumar41_at_[hidden])
Date: 2009-07-23 07:36:54

On Thu, Jul 23, 2009 at 3:03 PM, Ralph Castain <rhc_at_[hidden]> wrote:

> It depends on which network fails. If you lose all TCP connectivity, Open
> MPI should abort the job as the out-of-band system will detect the loss of
> connection. If you only lose the MPI connection (whether TCP or some other
> interconnect), then I believe the system will eventually generate an error
> after it retries sending the message a specified number of times, though it
> may not abort.
Thank you Ralph,

>From your reply I came to know that the question I posted earlier was not
reflecting the problem properly.

I can't use blocking communication routines in my main program (
"masterprocess") because any type of network failure( may be due to physical
connectivity or TCP connectivity or MPI connection as you told) may occur.
So I am using non blocking point to point communication routines, and TEST
later for completion of that Request. Once I enter a TEST loop I will test
for Request complition till TIMEOUT. Suppose TIMEOUT has occured, In this
case first I will check whether

 1: Slave machine is reachable or not, (How I will do that ??? Given - I
have IP address and Host Name of Slave machine.)

 2: if reachable, check whether program(orted and "slaveprocess") is alive
or not.

I don't want to abort my master process in case 1 and hope that network
connection will come up in future. Fortunately OpenMPI doesn't abort any
process. Both processes can run independently without communicating.

Thanks and Regards,

Vipin K.
Research Engineer,
C-DOTB, India