You don't specify and based on your description I infer that you are not using a batch/queueing system, but just a rsh/ssh based start-up mechanism.
You are absolutely correct. I am using rsh/ssh based start-up mechanism.
A batch/queueing system might be able to tell you whether a remote computer is still accessible.
Right now I don't have any Idea about batch/queuing system, I will explore about that also. And I think you mean it before launching the jobs.
I think that MPI is not the proper mechanism to achieve what you want. PVM or, maybe better, direct socket programming will probably serve you more.
I will think about these also.
I have already spent significant amount of time in LAM-MPI and OPEN-MPI and due to lack of time I don't want to switch to another mechanism. Anyway Open MPI is doing great for me, Atleast 80% what I want.