Sorry for delayed response - had some things to finish, then had to
stare at this code for awhile.
Unfortunately, the OOB is a snarled can of hideous worms. It looks to
me that the OOB continues to attempt to complete any pending message
requests once it detects that retries have exceeded the limit. In
doing so, it looks like it triggers pending events, which would
include pending sends - thus causing it to again emit that error
I can't swear to any of this, of course - the worms are really deep
and tangled down there.
A rewrite of the OOB is planned for next year - hopefully, the last of
the spaghetti to be unraveled. Not sure if that will really happen,
though, as I think everyone is afraid of that black hole of despair.
If it does, this is one thing we can try to address.
On Oct 17, 2008, at 11:02 AM, Leonardo Fialho wrote:
> Hi All,
> I´m doing some experiments and modifications in my heartbeat code
> witch uses the OOB-TCP communication channel.
> My modified orteds and orterun does not abort all processes when one
> orted die.
> The problem is:
> 1) I kill an orted, so another orted detect the fault when try to
> send a heartbeat to the faulty orted.
> 2) The RTE get stable again, by the orted which have sent the
> heartbeat print the following oob-tcp message:
> "[node1:21582] [[12518,0],1]-[[12518,0],2] oob-tcp: Communication
> retries exceeded. Can not communicate with peer"
> And the question is:
> a) Once an oob-tcp instance gets the mca_oob_tcp_peer_shutdown it
> discards this peer, no?
> b) The message is removed from the queue with ORTE_ERR_UNREACH code,
> c) Why, after retries exceed, the orted continue to plot this message?
> Leonardo Fialho
> Computer Architecture and Operating Systems Department - CAOS
> Universidad Autonoma de Barcelona - UAB
> ETSE, Edifcio Q, QC/3088
> Phone: +34-93-581-2888
> Fax: +34-93-581-2478
> devel mailing list