Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Fault Tolerant Features in OpenMPI
From: George Bosilca (bosilca_at_[hidden])
Date: 2013-08-12 06:43:48


Edson,

Based on your questions I would suggest you take a look at the ULFM-enabled version of Open MPI. You can find it at http://fault-tolerance.org/.

George.

On Aug 11, 2013, at 15:33 , Edson Tavares de Camargo <etcamargo_at_[hidden]> wrote:

> Thanks a lot for your reply, Ralph!
>
> Could you tell me in what situation the error handler would be called in
> the 1.6.5 version?
>
> I had thought that a failure in a process would be catched by the error
> handler. Kill, or abort, the process wouldn't the same behaviour?
>
> In the 1.7.4 release if a process was killed the error handler will be
> catched?
>
> Thanks,
>
> Edson
> ---------------------
>
>> The error handler wouldn't be called in that situation - we simply abort
>> the job. We expect to provide that integration in something like the 1.7.4
>> release milestone.
>>
>>
>> On Aug 10, 2013, at 11:07 AM, Edson Tavares de Camargo
>> <etcamargo_at_[hidden]> wrote:
>>
>>> Hi All,
>>>
>>> I was looking for posts about fault tolerant in MPI and I found the post
>>> below:
>>>
>>> http://www.open-mpi.org/community/lists/users/2012/06/19658.php
>>>
>>> I am trying to understand all work about failures detection present in
>>> open-mpi. So, I began with a simple application, a ring application
>>> (ring.c) , to understand errors handlers. But, it seems me that didn't
>>> work, why not? (the code is below)
>>>
>>> The application (the process) was running in the same machine with the
>>> following code line:
>>>
>>> $ mpiexec -n 4 ring
>>>
>>> While the ring application was running, one of the process was killed.
>>> So, the entire application stopped (ok until here), but didn't show me
>>> the
>>> error message. The line if(error != MPI_SUCCESS) should not worked?
>>>
>>> I am using the mpiexec (OpenRTE) 1.6.5.
>>>
>>> Thanks in advance,
>>>
>>> Edson
>>>
>>> -----------------------------------------------
>>> #include <stdio.h>
>>> #include <mpi.h>
>>> #include <time.h>
>>>
>>> int main( int argc, char *argv[] )
>>> {
>>> int rank, size;
>>> int n = 0;
>>> int tag = 0;
>>> int error;
>>> int root = 0;
>>> int next, previous;
>>> double start = 0;
>>> double finish = 0;
>>>
>>> MPI_Status status;
>>>
>>> MPI_Init( &argc, &argv );
>>> MPI_Comm_size(MPI_COMM_WORLD, &size);
>>> MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>>>
>>> // error handler
>>> MPI_Comm_set_errhandler(MPI_COMM_WORLD, MPI_ERRORS_RETURN);
>>>
>>> do {
>>> next = (rank + 1) % (size);
>>> n++;
>>>
>>> if(rank != 0){
>>> previous = (rank - 1);
>>> }else{
>>> previous = size - 1;
>>> }
>>>
>>> if (rank =
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users