Thank you all Jeff, Jody, Prentice and Bogdan for your invaluable clarification, solution and suggestion,
Open MPI should return a failure if TCP connectivity is lost, even with a non-blocking point-to-point operation. The failure should be returned in the call to MPI_TEST (and friends).
So I'm not sure your timeout has meaning here -- if you reach the timeout, I think it simply means that the MPI communication has not completed yet. It does not necessarily mean that the MPI communication has failed.
MPI doesn't provide any standard way to check reachability and/or health of a peer process.
How about you start your MPI program from a shell script that does the
following:
1. Reads a text file containing the names of all the possible candidates
for MPI nodes
2. Loops through the list of names from (1) and pings each machine to
see if it's alive. If the host is pingable, then write it's name to a
different text file which will be host as the machine file for the
mpirun command
3. Call mpirun using the machine file generated in (2).