Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Network connection check
From: vipin kumar (vipinkumar41_at_[hidden])
Date: 2009-07-23 10:25:26


Thank you all Jeff, Jody, Prentice and Bogdan for your invaluable
clarification, solution and suggestion,

Open MPI should return a failure if TCP connectivity is lost, even with a
> non-blocking point-to-point operation. The failure should be returned in
> the call to MPI_TEST (and friends).

even if MPI_TEST is a local operation?

> So I'm not sure your timeout has meaning here -- if you reach the timeout,
> I think it simply means that the MPI communication has not completed yet.
> It does not necessarily mean that the MPI communication has failed.
>

you are absolutely correct., but the job should be done before it expires.
that's the reason I am using TIMEOUT.

So the conclusion is :

>
> MPI doesn't provide any standard way to check reachability and/or health
> of a peer process.

That's what I wanted to confirm. And to find out the solution, if any, or
any alternative.

So now I think, I should go for Jody's approach

>
> How about you start your MPI program from a shell script that does the
> following:
>
> 1. Reads a text file containing the names of all the possible candidates
> for MPI nodes
>
> 2. Loops through the list of names from (1) and pings each machine to
> see if it's alive. If the host is pingable, then write it's name to a
> different text file which will be host as the machine file for the
> mpirun command
>

>
> 3. Call mpirun using the machine file generated in (2).
>

I am assuming processes have been launched successfully.

-- 
Vipin K.
Research Engineer,
C-DOTB, India