Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI devel] sm BTL flow management
From: George Bosilca (bosilca_at_[hidden])
Date: 2009-06-26 11:59:40

As Terry described and based on the patch attached to the ticket on
trac, the extra goto has slipped in the commit by mistake. It belongs
to a totally different patch for shared memory I'm working on. I'll
remove it.


On Jun 26, 2009, at 06:52 , Terry Dontje wrote:

> Eugene Loh wrote:
>> Brian W. Barrett wrote:
>>> All -
>>> Jeff, Eugene, and I had a long discussion this morning on the sm
>>> BTL flow management issues and came to a couple of conclusions.
>>> * Jeff, Eugene, and I are all convinced that Eugene's addition of
>>> polling the receive queue to drain acks when sends start backing
>>> up is required for deadlock avoidance.
>>> * We're also convinced that George's proposal, while a good idea
>>> in general, is not sufficient. The send path doesn't appear to
>>> sufficiently progress the btl to avoid the deadlocks we're seeing
>>> with the SM btl today. Therefore, while I still recommend sizing
>>> the fifo appropriately and limiting the freelist size, I think
>>> it's not sufficient to solve all problems.
>>> * Finally, it took an hour, but we did determine one of the major
>>> differences between 1.2.8 and 1.3.0 in terms of sm is how messages
>>> were pulled off the FIFO. In 1.2.8 (and all earlier versions), we
>>> return from btl_progress after a single message is received (ack
>>> or message) or the fifo was empty. In 1.3.0 (pre-srq work Eugene
>>> did), we changed to completely draining all queues before
>>> returning from btl_progress. This has led to a situation where a
>>> single call to btl_progress can make a large number of callbacks
>>> into the PML (900,000 times in one of Eugene's test case). The
>>> change was made to resolve an issue Terry was having with
>>> performance of a benchmark. We've decided that it would be
>>> adventageous to try something between the two points and drain X
>>> number of messages from the queue, then return, where X is 100 or
>>> so at most. This should cover the performance issues Terry saw,
>>> but still not cause the huge number of messages added to the
>>> unexpected queue with a single call to MPI_Recv. Since a recv
>>> that is matched on the unexpected queue doesn't result in a call
>>> to opal_progress, this should help balance the load a little bit
>>> better. Eugene's going to take a stab at implementing this short
>>> term.
>> I checked with Terry and we can't really recover the history here.
>> Perhaps draining ACKs is good enough. After the first message, we
>> can return.
> Ok recovering history here, not sure it matters though. First the
> performance issue George and I discussed and fixed is documented in
> thread
> As was mentioned this was only to retrieve ack packets and should
> not have any bearing on expanding the unexpected queue. The
> original change was r18724 and did not add line 432 mentioned below.
>> That's a one-line change. Just comment out line 432 ("goto
>> recheck_peer;") in
>> #432 .
> Line 432 was introduced by r19309 to fix ticket #1378. However
> something is more at hand because since Eugene's experiement show's
> removing this line doesn't help reduce the amount of unexpecteds.
>> Problem is, that doesn't "fix" things. That is, my deadlock
>> avoidance stuff (hg workspace on milliways that I sent out a
>> pointer to) seems to be enough to, well, avoid deadlock, but
>> unexpected-message queues are still growing like mad I think. Even
>> when sm progress returns after the first message fragment is
>> received. (X=1.) I think it's even true if the max free-list size
>> is capped at something small. I *think* (but am too tired to
>> "know") that the issue is we poll the FIFO often anyhow. We have
>> to for sends to reclaim fragments. We have to for receives, to
>> pull out messages of interest. Maybe things would be better if we
>> had one FIFO for in-coming fragments and another for returning
>> fragments. We could poll the latter only when we needed another
>> fragment for sending.
> So is the issue that Eugene describing is that one rank is flooding
> the other with so many messages that the flooded victim cannot see
> the FRAG_ACKs without draining the real (flooding) messages from the
> FIFO first?
> This seems like either having a separate FIFOs, as Eugene describes
> above, or instituting some type of flow control (number of inflight
> messages allowed) might help.
> --td
>> But I'm under pressure to shift my attention to other activities.
>> So, I think I'm going to abandon this effort. The flow control
>> problem seems thorny. I can think of fixes as fast as I can
>> identify flow-control problems, but the rate of new flow-control
>> problems just doesn't seem to abate. Meanwhile, my unexpected-work
>> queue grows unbounded. :^)
>>> I think the combination of Euegene's deadlock avoidance fix and
>>> the careful queue draining should make me comfortable enough to
>>> start another round of testing, but at least explains the bottom
>>> line issues.
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
> _______________________________________________
> devel mailing list
> devel_at_[hidden]