Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Jeff Squyres \(jsquyres\) (jsquyres_at_[hidden])
Date: 2006-04-25 07:20:00


You're not doing anything wrong; it's just that Open MPI doesn't [yet]
handle failures well. It will probably *eventually* respond with a
timeout (and therefore fail).

You might want to run a real resource manager to manage your cluster,
such as SLURM, Torque, or one of a bunch of commercial solutions. These
applications typically have some kind of daemon running on each node and
get fairly good notifications when nodes go down, etc.

 

> -----Original Message-----
> From: users-bounces_at_[hidden]
> [mailto:users-bounces_at_[hidden]] On Behalf Of
> Laurent.POREZ_at_[hidden]
> Sent: Tuesday, April 25, 2006 4:58 AM
> To: users_at_[hidden]
> Subject: [OMPI users] Checking the cluster status
> withMPI_Comm_spawn_multiple
>
> Hi,
>
> Before starting programs on my cluster, I want to check on
> every CPU if it is up and able to run MPI applications.
>
> For this, I use a kind of 'ping' program that just send a
> message saying 'I'm OK' tu a superviser program.
> The 'ping' program is sent by the superviser on each CPU by
> the MPI_Comm_spawn_multiple command.
>
> It works fine when every CPU is up, but when one is down, my
> superviser stops when calling the MPI_Comm_spawn_multiple command.
>
> So the questions are :
> * 'What am I doing wrong ?'
> * 'Is there a other way to check my CPUs ?'
>
> Thanks for your help.
>
> Laurent.
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>