Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] trac ticket 1944 and pending sends
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-06-24 10:06:54


I'm afraid that this solution doesn't pass the acid test - our reproducers
still lock up if we set the #frags to 1K and fifo size to p*that. In other
words, adding:

-mca btl_sm_free_list_max 1024 -mca btl_sm_fifo_size p*1024

where p=ppn still causes our reproducers to hang.

Sorry....sigh.

*From: *George Bosilca <bosilca_at_[hidden]>
> *Date: *June 24, 2009 12:46:28 AM MDT
> *To: *Open MPI Developers <devel_at_[hidden]>
> *Subject: **Re: [OMPI devel] trac ticket 1944 and pending sends*
> *Reply-To: *Open MPI Developers <devel_at_[hidden]>
>
> In other words, as long as a queue is peer based (peer not peers), the
> management of the pending send list was doing what it was supposed to, and
> there was no possibility of deadlock. With the new code, as a third party
> can fill up a remote queue, getting a fragment back [as you stated] became a
> poor indicator for retry.
>
> I don't see how the proposed solution will solve the issue without a
> significant overhead. As we only call the MCA_BTL_SM_FIFO_WRITE once before
> the fragment get into the pending list, reordering the fragments will not
> solve the issue. When the peers is overloaded, the fragments will end-up in
> the pending list, and there is nothing to get it out of there except a
> message from the peer. In some cases, such a message might never be
> delivered, simply because the peer doesn't have any data to send us.
>
> The other solution is to always check all pending lists. While this might
> work, it will certainly add undesirable overhead to the send path.
>
> You last patch was doing the right thing. Globally decreasing the size of
> the memory used by the MPI library is _the right_ way to go. Unfortunately,
> your patch only address this at the level of the shared memory file. Now,
> instead of using less memory we use even more because we have to store that
> data somewhere ... in the fragments returned by the btl_sm_alloc function.
> These fragments are allocated on demand and by default there is no limit to
> the number of such fragments.
>
> Here is a simple fix for both problems. Enforce a reasonable limit on the
> number of fragments in the BTL free list (1K should be more than enough),
> and make sure the fifo has a size equal to p *
> number_of_allowed_fragments_in_the_free_list, where p is the number of local
> processes. While this solution will certainly increase again the size of the
> mapped file, it will do it by a small margin compared with what is happening
> today in the code. This is without talking about the fact that it will solve
> the deadlock problem, by removing the inability to return a fragment. In
> addition, the PML is capable of handing such situations, so we're getting
> back to a deadlock free sm BTL.
>
> george.
>
>
> On Jun 23, 2009, at 11:04 , Eugene Loh wrote:
>
> The sm BTL used to have two mechanisms for dealing with congested FIFOs.
> One was to grow the FIFOs. Another was to queue pending sends locally (on
> the sender's side). I think the grow-FIFO mechanism was typically invoked
> and the pending-send mechanism used only under extreme circumstances (no
> more memory).
>
>
> With the sm makeover of 1.3.2, we dropped the ability to grow FIFOs. The
> code added complexity and there seemed to be no need to have two mechanisms
> to deal with congested FIFOs. In ticket 1944, however, we see that repeated
> collectives can produce hangs, and this seems to be due to the pending-send
> code not adequately dealing with congested FIFOs.
>
>
> Today, when a process tries to write to a remote FIFO and fails, it queues
> the write as a pending send. The only condition under which it retries
> pending sends is when it gets a fragment back from a remote process.
>
>
> I think the logic must have been that the FIFO got congested because we
> issued too many sends. Getting a fragment back indicates that the remote
> process has made progress digesting those sends. In ticket 1944, we see
> that a FIFO can also get congested from too many returning fragments.
> Further, with shared FIFOs, a FIFO could become congested due to the
> activity of a third-party process.
>
>
> In sum, getting a fragment back from a remote process is a poor indicator
> that it's time to retry pending sends.
>
>
> Maybe the real way to know when to retry pending sends is just to check if
> there's room on the FIFO.
>
>
> So, I'll try modifying MCA_BTL_SM_FIFO_WRITE. It'll start by checking if
> there are pending sends. If so, it'll retry them before performing the
> requested write. This should also help preserve ordering a little better.
> I'm guessing this will not hurt our message latency in any meaningful way,
> but I'll check this out.
>
>
> Meanwhile, I wanted to check in with y'all for any guidance you might have.
>
> _______________________________________________
>
> devel mailing list
>
> devel_at_[hidden]
>
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
>