Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Laurent.POREZ_at_[hidden]
Date: 2006-10-27 10:56:15


> From: George Bosilca <bosilca_at_[hidden]>
> Subject: Re: [OMPI users] Error Handling Problem
> To: Open MPI Users <users_at_[hidden]>
> Message-ID: <EF68521D-C116-4E75-8FAC-5CE918E56D15_at_[hidden]>
> Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed
>
> How about changing the default error handler ?

I did change the default error handler (using Mpi_Comm_set_errhandler) in the main_exe program. I replaced it with a printf.
My error handler is never called, but main_exe receives a SIGPIPE signal.
So the only solution I found is to catch SIGPIPE and forget it...>

> It is not supposed to work, and if you find an MPI implementation
> that support this approach please tell me. I know the paper
> where you
> read about this, but even with their MPI library this approach does
> not work.

which paper are you talking about ?

>
> Soon, Open MPI will be able to support this feature. Several fault
> tolerant modes are under way, but no precise timeline yet.

OK. I keep watching for new versions of Open MPI.

Thanks,
        Laurent.

>
> Thanks,
> george.
>
> On Oct 26, 2006, at 10:19 AM, Laurent.POREZ_at_[hidden] wrote:
>
> > Hi,
> >
> > I developped a launcher application :
> > a MPI application (say main_exe) lauches 2 MPI applications (say
> > exe1 and exe2), using MPI_Comm_spawn_multiple.
> >
> > Now, I'm looking at the behavior when an exe crashes.
> >
> > What I can see is the following :
> > 1) when everybody is launched, I see the followings processes,
> > using 'ps' :
> > - the 'mpiexec -v -d -n 1 ./main_exe' command
> > - the orted server used for 'main_exe' (say 'orted1')
> > - main_exe
> > - the orted server used for 'exe1' and 'exe2' (say 'orted2')
> > - exe1
> > - exe2
> >
> > 2) I use kill -9 to 'crash' exe2
> >
> > 3) orted2 and exe1 finish.
> >
> > 4) with ps, I see it remains the following processes : mpiexec,
> > 'orted1', main_exe
> >
> > 5) main_exe tries to send a message to exe1, using MPI_Bsend :
> > main_exe gets killed by a SIG_PIPE signal !!!!
> >
> > So what I see is that when a part of an MPI application crashes,
> > the whole application crashes !
> > Is there a way to get an other behavior ? For exemple, MPI_Bsend
> > could return an error message.
> >
> > A few additionnal informations :
> > - I work on linux, with Open-MPI 1.1.1.
> > - I'm developping in C and C++.
> >
> > Thanks,
> > Laurent.
>