Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Josh Hursey (jjhursey_at_[hidden])
Date: 2006-02-19 11:26:02


Abhishek,

What you are trying to do is not exactly supported by the MPI
standard. If a process in a MPI communicator is killed (by a node
failure, 'kill' command, segmentation fault, or other unexpected
failure) and you are blocking on a MPI call, you are not always
guaranteed to receive an error. So in the case you cite:

> --------------------------
> val = MPI_Recv(&ans, 1, MPI_DOUBLE, MPI_ANY_SOURCE, MPI_ANY_TAG,
> newcomm[i], &stat[i]);
> if (val != MPI_SUCCESS )
> printf("Manager: error in Recv\n");
>
> --------------------------

You are using MPI_ANY_SOURCE and MPI_ANY_TAG, so it is reasonable for
the MPI_RECV to continue blocking, since we could receive a message
from another process in the communicator.

Since fault tolerance is not in the MPI standard, when a process
exits unexpectedly the state of the MPI library is undefined by the
standard. Some MPI implementations will not allow you to call back
into them, others will allow you to with very limited functionality
(you may be able to only call MPI_FINALIZE), and others will allow
you to use it with no limitations.

There are implementations of MPI that allow for various degrees of
process fault tolerance (many of them are active contributors to the
Open MPI project). For instance, the FT-MPI style of fault tolerance
(http://icl.cs.utk.edu/ftmpi/) allows an MPI program to continue
execution even if one process in the communicator fails. We are
working on integrating this style (and a few other styles) of fault
tolerance into Open MPI.

There is another model of fault tolerance in which you would use
MPI_COMM_SPAWN to dynamically create communication groups and use
those communicators for a form of process fault tolerance. William
Gropp and Ewing Lusk wrote a good description of this in their 2004
paper "Fault Tolerance in Message Passing Interface Programs" (http://
dx.doi.org/10.1177/1094342004046045), if you are interested in
pursuing this type of program.

So in short, MPI_Recv is behaving as it should in this situation
since it could be waiting for other processes in the communication
group to send a message. If you need to support program continuation
even in the face of single process failures take a look at the
dynamic process manager-worker model or you might explore FT-MPI's
API for dealing with process loss in a communication group.

I hope this helps, good luck!

Josh

On Feb 16, 2006, at 10:11 AM, Abhishek Agarwal wrote:

> Hello All,
>
> I am trying to use the MPI_Recv of the open-mpi, but met some
> problems with
> MPI_Recv.
>
> I have two processes in master slave mode and I killed the slave
> process but
> my MPI_Recv process is still waiting to get a response from slave
> and never
> times out with any error. I am checking the MPI_SUCCESS but it
> seems to wait
> for ever and hence the program hangs.
>
> I am attaching the secition of code which i have used in my program.
>
>
> --------------------------
> val = MPI_Recv(&ans, 1, MPI_DOUBLE, MPI_ANY_SOURCE, MPI_ANY_TAG,
> newcomm[i], &stat[i]);
> if (val != MPI_SUCCESS )
> printf("Manager: error in Recv\n");
>
> --------------------------
>
> Any advice?
>
> Thanks,
>
> Abhishek Agarwal
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

----
Josh Hursey
jjhursey_at_[hidden]
http://www.open-mpi.org/