It depends on which network fails. If you lose all TCP connectivity,
Open MPI should abort the job as the out-of-band system will detect
the loss of connection. If you only lose the MPI connection (whether
TCP or some other interconnect), then I believe the system will
eventually generate an error after it retries sending the message a
specified number of times, though it may not abort.
On Jul 22, 2009, at 10:55 PM, vipin kumar wrote:
> Are you asking to find out this information before issuing
> "mpirun"? Open MPI does assume that the nodes you are trying to use
> are reachable.
> Scenario is a pair of processes are running one in "master" node say
> "masterprocess" and one in "slave" node say "slaveprocess". When
> "masterprocess" needs service of slave process, it sends message to
> "slaveprocess" and "slaveprocess" serves its request. In case of
> Network failure(by any means) "masterprocess" will keep trying to
> send message to "slaveprocess" without knowing that it is not
> reachable. So how "masterprocess" should finds out that
> "slaveprocess" can't be reached and leave attempting to send
> messages till Connection is not up.
> Thanks & Regards,
> Vipin K.
> Research Engineer,
> C-DOTB, India
> users mailing list