How about changing the default error handler ?
It is not supposed to work, and if you find an MPI implementation
that support this approach please tell me. I know the paper where you
read about this, but even with their MPI library this approach does
Soon, Open MPI will be able to support this feature. Several fault
tolerant modes are under way, but no precise timeline yet.
On Oct 26, 2006, at 10:19 AM, Laurent.POREZ_at_[hidden] wrote:
> I developped a launcher application :
> a MPI application (say main_exe) lauches 2 MPI applications (say
> exe1 and exe2), using MPI_Comm_spawn_multiple.
> Now, I'm looking at the behavior when an exe crashes.
> What I can see is the following :
> 1) when everybody is launched, I see the followings processes,
> using 'ps' :
> - the 'mpiexec -v -d -n 1 ./main_exe' command
> - the orted server used for 'main_exe' (say 'orted1')
> - main_exe
> - the orted server used for 'exe1' and 'exe2' (say 'orted2')
> - exe1
> - exe2
> 2) I use kill -9 to 'crash' exe2
> 3) orted2 and exe1 finish.
> 4) with ps, I see it remains the following processes : mpiexec,
> 'orted1', main_exe
> 5) main_exe tries to send a message to exe1, using MPI_Bsend :
> main_exe gets killed by a SIG_PIPE signal !!!!
> So what I see is that when a part of an MPI application crashes,
> the whole application crashes !
> Is there a way to get an other behavior ? For exemple, MPI_Bsend
> could return an error message.
> A few additionnal informations :
> - I work on linux, with Open-MPI 1.1.1.
> - I'm developping in C and C++.
> users mailing list