Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: George Bosilca (bosilca_at_[hidden])
Date: 2006-10-27 17:52:26


On Oct 27, 2006, at 10:56 AM, Laurent.POREZ_at_[hidden] wrote:

> I did change the default error handler (using
> Mpi_Comm_set_errhandler) in the main_exe program. I replaced it
> with a printf.
> My error handler is never called, but main_exe receives a SIGPIPE
> signal.
> So the only solution I found is to catch SIGPIPE and forget it...>

I wonder how this SIGPIPE get generated ... And why we didn't catch it.

>
>> It is not supposed to work, and if you find an MPI implementation
>> that support this approach please tell me. I know the paper
>> where you read about this, but even with their MPI library this
>> approach does
>> not work.
>
> which paper are you talking about ?

I was talking about W. Gropp paper called "Fault Tolerance in MPI
Programs". I don't remember where it was published, it might be one
of the Euro PVM/MPI conferences. Here is a link to the paper (http://
www-unix.mcs.anl.gov/~gropp/bib/papers/2002/mpi-fault.pdf)

   Thanks,
     george.

>
> Thanks,
> Laurent.
>
>>
>> Thanks,
>> george.
>>
>> On Oct 26, 2006, at 10:19 AM, Laurent.POREZ_at_[hidden] wrote:
>>
>>> Hi,
>>>
>>> I developped a launcher application :
>>> a MPI application (say main_exe) lauches 2 MPI applications (say
>>> exe1 and exe2), using MPI_Comm_spawn_multiple.
>>>
>>> Now, I'm looking at the behavior when an exe crashes.
>>>
>>> What I can see is the following :
>>> 1) when everybody is launched, I see the followings processes,
>>> using 'ps' :
>>> - the 'mpiexec -v -d -n 1 ./main_exe' command
>>> - the orted server used for 'main_exe' (say 'orted1')
>>> - main_exe
>>> - the orted server used for 'exe1' and 'exe2' (say 'orted2')
>>> - exe1
>>> - exe2
>>>
>>> 2) I use kill -9 to 'crash' exe2
>>>
>>> 3) orted2 and exe1 finish.
>>>
>>> 4) with ps, I see it remains the following processes : mpiexec,
>>> 'orted1', main_exe
>>>
>>> 5) main_exe tries to send a message to exe1, using MPI_Bsend :
>>> main_exe gets killed by a SIG_PIPE signal !!!!
>>>
>>> So what I see is that when a part of an MPI application crashes,
>>> the whole application crashes !
>>> Is there a way to get an other behavior ? For exemple, MPI_Bsend
>>> could return an error message.
>>>
>>> A few additionnal informations :
>>> - I work on linux, with Open-MPI 1.1.1.
>>> - I'm developping in C and C++.
>>>
>>> Thanks,
>>> Laurent.
>>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users