You're not doing anything wrong; it's just that Open MPI doesn't [yet]
handle failures well. It will probably *eventually* respond with a
timeout (and therefore fail).
You might want to run a real resource manager to manage your cluster,
such as SLURM, Torque, or one of a bunch of commercial solutions. These
applications typically have some kind of daemon running on each node and
get fairly good notifications when nodes go down, etc.
> -----Original Message-----
> From: users-bounces_at_[hidden]
> [mailto:users-bounces_at_[hidden]] On Behalf Of
> Sent: Tuesday, April 25, 2006 4:58 AM
> To: users_at_[hidden]
> Subject: [OMPI users] Checking the cluster status
> Before starting programs on my cluster, I want to check on
> every CPU if it is up and able to run MPI applications.
> For this, I use a kind of 'ping' program that just send a
> message saying 'I'm OK' tu a superviser program.
> The 'ping' program is sent by the superviser on each CPU by
> the MPI_Comm_spawn_multiple command.
> It works fine when every CPU is up, but when one is down, my
> superviser stops when calling the MPI_Comm_spawn_multiple command.
> So the questions are :
> * 'What am I doing wrong ?'
> * 'Is there a other way to check my CPUs ?'
> Thanks for your help.
> users mailing list