It may be there from a long time ago. When the OOB loses a connection, nothing is supposed to happen unless that connection is defined as a "lifeline". Remember, the OOB is not an MPI transport - it is there solely to handle support functions and therefore is not considered "mission critical". So losing an OOB connection isn't considered a "fatal" problem unless it is to the "lifeline".

We define a lifeline solely for the case where a daemon dies and we need the local procs to "suicide" and mpirun to terminate the job. So I guess the question is: which connection failed? Was this a connection from a daemon back to mpirun?

Or were you running as a direct launch process - i.e., the connection was between two MPI procs that were launched via srun? If so, then there is no "lifeline" - if a connection drops, you are on your own. Not much we can do about that scenario as you really don't want to abort just because a non-critical connection fails.

On Jun 26, 2012, at 1:09 AM, wrote:

Version 1.6. But it's already there in 1.5.4. a écrit : -----
A : Open MPI Developers <>
De : Ralph Castain
Envoyé par :
Date : 25/06/2012 17:57
Objet : Re: [OMPI devel] Problem in oob/tcp

What version?

On Jun 25, 2012, at 9:53 AM, wrote:

Hi everybody,

I'm facing a problem in orte/oob/tcp/, more particularly in file oob_tcp_msg.c. Some network interruptions were making my program hanging and not crashing (a basic helloworld).

Thus, I reproduced the problem with gdb, by simulating an error on read (jumping from line 357 to 367, oob_tcp_msg.c). Then, openmpi close the socket, make the shutdown and then is hanging.

It seems that there is an exception callback function (mca_oob_tcp.oob_exception_callback) "planned" but not implemented yet.

Any idea on how to solve this problem ? Or is this the expected behavior when we lose connection ? Did I missed anything ?

Thanks in advance,

devel mailing list

devel mailing list
devel mailing list