Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] Fault Tolerant Features in OpenMPI
From: George Bosilca (bosilca_at_[hidden])
Date: 2013-08-12 06:43:48


Edson,

Based on your questions I would suggest you take a look at the ULFM-enabled version of Open MPI. You can find it at http://fault-tolerance.org/.

George.

On Aug 11, 2013, at 15:33 , Edson Tavares de Camargo <etcamargo_at_[hidden]> wrote:

> Thanks a lot for your reply, Ralph!
>
> Could you tell me in what situation the error handler would be called in
> the 1.6.5 version?
>
> I had thought that a failure in a process would be catched by the error
> handler. Kill, or abort, the process wouldn't the same behaviour?
>
> In the 1.7.4 release if a process was killed the error handler will be
> catched?
>
> Thanks,
>
> Edson
> ---------------------
>
>> The error handler wouldn't be called in that situation - we simply abort
>> the job. We expect to provide that integration in something like the 1.7.4
>> release milestone.
>>
>>
>> On Aug 10, 2013, at 11:07 AM, Edson Tavares de Camargo
>> <etcamargo_at_[hidden]> wrote:
>>
>>> Hi All,
>>>
>>> I was looking for posts about fault tolerant in MPI and I found the post
>>> below:
>>>
>>> http://www.open-mpi.org/community/lists/users/2012/06/19658.php
>>>
>>> I am trying to understand all work about failures detection present in
>>> open-mpi. So, I began with a simple application, a ring application
>>> (ring.c) , to understand errors handlers. But, it seems me that didn't
>>> work, why not? (the code is below)
>>>
>>> The application (the process) was running in the same machine with the
>>> following code line:
>>>
>>> $ mpiexec -n 4 ring
>>>
>>> While the ring application was running, one of the process was killed.
>>> So, the entire application stopped (ok until here), but didn't show me
>>> the
>>> error message. The line if(error != MPI_SUCCESS) should not worked?
>>>
>>> I am using the mpiexec (OpenRTE) 1.6.5.
>>>
>>> Thanks in advance,
>>>
>>> Edson
>>>
>>> -----------------------------------------------
>>> #include <stdio.h>
>>> #include <mpi.h>
>>> #include <time.h>
>>>
>>> int main( int argc, char *argv[] )
>>> {
>>> int rank, size;
>>> int n = 0;
>>> int tag = 0;
>>> int error;
>>> int root = 0;
>>> int next, previous;
>>> double start = 0;
>>> double finish = 0;
>>>
>>> MPI_Status status;
>>>
>>> MPI_Init( &argc, &argv );
>>> MPI_Comm_size(MPI_COMM_WORLD, &size);
>>> MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>>>
>>> // error handler
>>> MPI_Comm_set_errhandler(MPI_COMM_WORLD, MPI_ERRORS_RETURN);
>>>
>>> do {
>>> next = (rank + 1) % (size);
>>> n++;
>>>
>>> if(rank != 0){
>>> previous = (rank - 1);
>>> }else{
>>> previous = size - 1;
>>> }
>>>
>>> if (rank =
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users