I´m not an expert in C neither Open MPI, but I´m a volunteer.
Ralph Castain escribió:
> Sorry for delayed response - had some things to finish, then had to
> stare at this code for awhile.
> Unfortunately, the OOB is a snarled can of hideous worms. It looks to
> me that the OOB continues to attempt to complete any pending message
> requests once it detects that retries have exceeded the limit. In
> doing so, it looks like it triggers pending events, which would
> include pending sends - thus causing it to again emit that error message.
> I can't swear to any of this, of course - the worms are really deep
> and tangled down there.
> A rewrite of the OOB is planned for next year - hopefully, the last of
> the spaghetti to be unraveled. Not sure if that will really happen,
> though, as I think everyone is afraid of that black hole of despair.
> If it does, this is one thing we can try to address.
> Any volunteers??
> On Oct 17, 2008, at 11:02 AM, Leonardo Fialho wrote:
>> Hi All,
>> I´m doing some experiments and modifications in my heartbeat code
>> witch uses the OOB-TCP communication channel.
>> My modified orteds and orterun does not abort all processes when one
>> orted die.
>> The problem is:
>> 1) I kill an orted, so another orted detect the fault when try to
>> send a heartbeat to the faulty orted.
>> 2) The RTE get stable again, by the orted which have sent the
>> heartbeat print the following oob-tcp message:
>> "[node1:21582] [[12518,0],1]-[[12518,0],2] oob-tcp: Communication
>> retries exceeded. Can not communicate with peer"
>> And the question is:
>> a) Once an oob-tcp instance gets the mca_oob_tcp_peer_shutdown it
>> discards this peer, no?
>> b) The message is removed from the queue with ORTE_ERR_UNREACH code, no?
>> c) Why, after retries exceed, the orted continue to plot this message?
>> Leonardo Fialho
>> Computer Architecture and Operating Systems Department - CAOS
>> Universidad Autonoma de Barcelona - UAB
>> ETSE, Edifcio Q, QC/3088
>> Phone: +34-93-581-2888
>> Fax: +34-93-581-2478
>> devel mailing list
> devel mailing list
Computer Architecture and Operating Systems Department - CAOS
Universidad Autonoma de Barcelona - UAB
ETSE, Edifcio Q, QC/3088