Hi,
Before starting programs on my cluster, I want to check on every CPU if it is up and able to run MPI applications.
For this, I use a kind of 'ping' program that just send a message saying 'I'm OK' tu a superviser program.
The 'ping' program is sent by the superviser on each CPU by the MPI_Comm_spawn_multiple command.
It works fine when every CPU is up, but when one is down, my superviser stops when calling the MPI_Comm_spawn_multiple command.
So the questions are :
* 'What am I doing wrong ?'
* 'Is there a other way to check my CPUs ?'
Thanks for your help.
Laurent.
|