Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Problem in oob/tcp
From: Ralph Castain (rhc_at_[hidden])
Date: 2012-06-26 08:42:17


It may be there from a long time ago. When the OOB loses a connection, nothing is supposed to happen unless that connection is defined as a "lifeline". Remember, the OOB is not an MPI transport - it is there solely to handle support functions and therefore is not considered "mission critical". So losing an OOB connection isn't considered a "fatal" problem unless it is to the "lifeline".

We define a lifeline solely for the case where a daemon dies and we need the local procs to "suicide" and mpirun to terminate the job. So I guess the question is: which connection failed? Was this a connection from a daemon back to mpirun?

Or were you running as a direct launch process - i.e., the connection was between two MPI procs that were launched via srun? If so, then there is no "lifeline" - if a connection drops, you are on your own. Not much we can do about that scenario as you really don't want to abort just because a non-critical connection fails.

On Jun 26, 2012, at 1:09 AM, Ludovic.Hablot_at_[hidden] wrote:

> Version 1.6. But it's already there in 1.5.4.
>
> -----devel-bounces_at_[hidden] a écrit : -----
> A : Open MPI Developers <devel_at_[hidden]>
> De : Ralph Castain
> Envoyé par : devel-bounces_at_[hidden]
> Date : 25/06/2012 17:57
> Objet : Re: [OMPI devel] Problem in oob/tcp
>
> What version?
>
> On Jun 25, 2012, at 9:53 AM, Ludovic.Hablot_at_[hidden] wrote:
>
>> Hi everybody,
>>
>> I'm facing a problem in orte/oob/tcp/, more particularly in file oob_tcp_msg.c. Some network interruptions were making my program hanging and not crashing (a basic helloworld).
>>
>> Thus, I reproduced the problem with gdb, by simulating an error on read (jumping from line 357 to 367, oob_tcp_msg.c). Then, openmpi close the socket, make the shutdown and then is hanging.
>>
>> It seems that there is an exception callback function (mca_oob_tcp.oob_exception_callback) "planned" but not implemented yet.
>>
>> Any idea on how to solve this problem ? Or is this the expected behavior when we lose connection ? Did I missed anything ?
>>
>> Thanks in advance,
>>
>> Ludovic
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel