Thank you all Jeff, Jody, Prentice and Bogdan for your invaluable
clarification, solution and suggestion,
Open MPI should return a failure if TCP connectivity is lost, even with a
> non-blocking point-to-point operation. The failure should be returned in
> the call to MPI_TEST (and friends).
even if MPI_TEST is a local operation?
> So I'm not sure your timeout has meaning here -- if you reach the timeout,
> I think it simply means that the MPI communication has not completed yet.
> It does not necessarily mean that the MPI communication has failed.
>
you are absolutely correct., but the job should be done before it expires.
that's the reason I am using TIMEOUT.
So the conclusion is :
>
> MPI doesn't provide any standard way to check reachability and/or health
> of a peer process.
That's what I wanted to confirm. And to find out the solution, if any, or
any alternative.
So now I think, I should go for Jody's approach
>
> How about you start your MPI program from a shell script that does the
> following:
>
> 1. Reads a text file containing the names of all the possible candidates
> for MPI nodes
>
> 2. Loops through the list of names from (1) and pings each machine to
> see if it's alive. If the host is pingable, then write it's name to a
> different text file which will be host as the machine file for the
> mpirun command
>
>
> 3. Call mpirun using the machine file generated in (2).
>
I am assuming processes have been launched successfully.
--
Vipin K.
Research Engineer,
C-DOTB, India
|