Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] OOB-TCP Retries
From: Leonardo Fialho (lfialho_at_[hidden])
Date: 2008-10-30 15:42:28


I´m not an expert in C neither Open MPI, but I´m a volunteer.

Leonardo

Ralph Castain escribió:
> Sorry for delayed response - had some things to finish, then had to
> stare at this code for awhile.
>
> Unfortunately, the OOB is a snarled can of hideous worms. It looks to
> me that the OOB continues to attempt to complete any pending message
> requests once it detects that retries have exceeded the limit. In
> doing so, it looks like it triggers pending events, which would
> include pending sends - thus causing it to again emit that error message.
>
> I can't swear to any of this, of course - the worms are really deep
> and tangled down there.
>
> A rewrite of the OOB is planned for next year - hopefully, the last of
> the spaghetti to be unraveled. Not sure if that will really happen,
> though, as I think everyone is afraid of that black hole of despair.
> If it does, this is one thing we can try to address.
>
> Any volunteers??
>
> Ralph
>
>
> On Oct 17, 2008, at 11:02 AM, Leonardo Fialho wrote:
>
>> Hi All,
>>
>> I´m doing some experiments and modifications in my heartbeat code
>> witch uses the OOB-TCP communication channel.
>>
>> My modified orteds and orterun does not abort all processes when one
>> orted die.
>>
>> The problem is:
>>
>> 1) I kill an orted, so another orted detect the fault when try to
>> send a heartbeat to the faulty orted.
>>
>> 2) The RTE get stable again, by the orted which have sent the
>> heartbeat print the following oob-tcp message:
>> "[node1:21582] [[12518,0],1]-[[12518,0],2] oob-tcp: Communication
>> retries exceeded. Can not communicate with peer"
>>
>> And the question is:
>>
>> a) Once an oob-tcp instance gets the mca_oob_tcp_peer_shutdown it
>> discards this peer, no?
>>
>> b) The message is removed from the queue with ORTE_ERR_UNREACH code, no?
>>
>> c) Why, after retries exceed, the orted continue to plot this message?
>>
>> Thanks,
>> --
>>
>> Leonardo Fialho
>> Computer Architecture and Operating Systems Department - CAOS
>> Universidad Autonoma de Barcelona - UAB
>> ETSE, Edifcio Q, QC/3088
>> http://www.caos.uab.es
>> Phone: +34-93-581-2888
>> Fax: +34-93-581-2478
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 
Leonardo Fialho
Computer Architecture and Operating Systems Department - CAOS
Universidad Autonoma de Barcelona - UAB
ETSE, Edifcio Q, QC/3088
http://www.caos.uab.es
Phone: +34-93-581-2888
Fax: +34-93-581-2478