Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: [OMPI devel] OOB-TCP Retries
From: Leonardo Fialho (lfialho_at_[hidden])
Date: 2008-10-17 13:02:17

Hi All,

I´m doing some experiments and modifications in my heartbeat code witch
uses the OOB-TCP communication channel.

My modified orteds and orterun does not abort all processes when one
orted die.

The problem is:

1) I kill an orted, so another orted detect the fault when try to send a
heartbeat to the faulty orted.

2) The RTE get stable again, by the orted which have sent the heartbeat
print the following oob-tcp message:
"[node1:21582] [[12518,0],1]-[[12518,0],2] oob-tcp: Communication
retries exceeded. Can not communicate with peer"

And the question is:

a) Once an oob-tcp instance gets the mca_oob_tcp_peer_shutdown it
discards this peer, no?

b) The message is removed from the queue with ORTE_ERR_UNREACH code, no?

c) Why, after retries exceed, the orted continue to plot this message?


Leonardo Fialho
Computer Architecture and Operating Systems Department - CAOS
Universidad Autonoma de Barcelona - UAB
ETSE, Edifcio Q, QC/3088
Phone: +34-93-581-2888
Fax: +34-93-581-2478