Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] trac ticket 1944 and pending sends
From: Eugene Loh (Eugene.Loh_at_[hidden])
Date: 2009-06-23 13:23:02


George Bosilca wrote:

> On Jun 23, 2009, at 11:04 , Eugene Loh wrote:
>
>> The sm BTL used to have two mechanisms for dealing with congested
>> FIFOs. One was to grow the FIFOs. Another was to queue pending
>> sends locally (on the sender's side). I think the grow-FIFO
>> mechanism was typically invoked and the pending-send mechanism used
>> only under extreme circumstances (no more memory).
>>
>> With the sm makeover of 1.3.2, we dropped the ability to grow
>> FIFOs. The code added complexity and there seemed to be no need to
>> have two mechanisms to deal with congested FIFOs. In ticket 1944,
>> however, we see that repeated collectives can produce hangs, and
>> this seems to be due to the pending-send code not adequately dealing
>> with congested FIFOs.
>>
>> Today, when a process tries to write to a remote FIFO and fails, it
>> queues the write as a pending send. The only condition under which
>> it retries pending sends is when it gets a fragment back from a
>> remote process.
>>
>> I think the logic must have been that the FIFO got congested because
>> we issued too many sends. Getting a fragment back indicates that
>> the remote process has made progress digesting those sends. In
>> ticket 1944, we see that a FIFO can also get congested from too many
>> returning fragments. Further, with shared FIFOs, a FIFO could
>> become congested due to the activity of a third-party process.
>>
>> In sum, getting a fragment back from a remote process is a poor
>> indicator that it's time to retry pending sends.
>>
>> Maybe the real way to know when to retry pending sends is just to
>> check if there's room on the FIFO.
>
> Why this is different than "getting a fragment back"?

I'm not sure I understand your question.

Say we have two processes, A and B. Each one has a receive queue/FIFO
that can be written by its peer. Let's say A sends lots of messages to
B. B keeps on returning fragments to A. So, although we're saying that
A sends lots of messages to B, it is A's in-bound queue that fills up.
Kind of counterintuitive. Anyhow, B keeps getting more fragments to
return to A. Since A's queue is full, what this means is that B adds
these fragments to its (B's) own pending-send list.

So, now the question is when B should retry items on its pending-send
list. Presumably, it should retry when there is room on A's
queue/FIFO. But OMPI (to date) has B retry *only* when B itself gets a
fragment back. What's the logic? I assume the logic was that A's queue
was filled with fragments that B had sent, so getting a fragment back
would be an indication of A's queue opening up.

Why is this a poor indication? (I'm assuming this is what your question
was.) Two possible reasons:

1) A's queue might have been filled with fragments that B was returning
to A. So, B would get no acknowledgements back from A that progress was
being made depleting the queue.

2) (New with OMPI 1.3.2, now that we have shared queues): A's queue
might have been filled with activity from third party processes.

In either case, the only way B now knows whether there is room on A's
queue is... to check the queue if there's room! Nothing is coming back
from A to indicate that the queue is being drained.

> As far as I remember the code, when we get a fragment back we add it
> back in the LIFO, and therefore it become the next available fragment
> for a send.

Yes, indeed, but I don't understand how this is relevent. The LIFOs
(the private free lists where processes maintain unused fragments) don't
really enter this discussion.

>> So, I'll try modifying MCA_BTL_SM_FIFO_WRITE. It'll start by
>> checking if there are pending sends. If so, it'll retry them before
>> performing the requested write. This should also help preserve
>> ordering a little better. I'm guessing this will not hurt our
>> message latency in any meaningful way, but I'll check this out.
>>
>> Meanwhile, I wanted to check in with y'all for any guidance you
>> might have.
>