Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] sm BTL flow management
From: George Bosilca (bosilca_at_[hidden])
Date: 2009-06-26 11:59:40

As Terry described and based on the patch attached to the ticket on
trac, the extra goto has slipped in the commit by mistake. It belongs
to a totally different patch for shared memory I'm working on. I'll
remove it.


On Jun 26, 2009, at 06:52 , Terry Dontje wrote:

> Eugene Loh wrote:
>> Brian W. Barrett wrote:
>>> All -
>>> Jeff, Eugene, and I had a long discussion this morning on the sm
>>> BTL flow management issues and came to a couple of conclusions.
>>> * Jeff, Eugene, and I are all convinced that Eugene's addition of
>>> polling the receive queue to drain acks when sends start backing
>>> up is required for deadlock avoidance.
>>> * We're also convinced that George's proposal, while a good idea
>>> in general, is not sufficient. The send path doesn't appear to
>>> sufficiently progress the btl to avoid the deadlocks we're seeing
>>> with the SM btl today. Therefore, while I still recommend sizing
>>> the fifo appropriately and limiting the freelist size, I think
>>> it's not sufficient to solve all problems.
>>> * Finally, it took an hour, but we did determine one of the major
>>> differences between 1.2.8 and 1.3.0 in terms of sm is how messages
>>> were pulled off the FIFO. In 1.2.8 (and all earlier versions), we
>>> return from btl_progress after a single message is received (ack
>>> or message) or the fifo was empty. In 1.3.0 (pre-srq work Eugene
>>> did), we changed to completely draining all queues before
>>> returning from btl_progress. This has led to a situation where a
>>> single call to btl_progress can make a large number of callbacks
>>> into the PML (900,000 times in one of Eugene's test case). The
>>> change was made to resolve an issue Terry was having with
>>> performance of a benchmark. We've decided that it would be
>>> adventageous to try something between the two points and drain X
>>> number of messages from the queue, then return, where X is 100 or
>>> so at most. This should cover the performance issues Terry saw,
>>> but still not cause the huge number of messages added to the
>>> unexpected queue with a single call to MPI_Recv. Since a recv
>>> that is matched on the unexpected queue doesn't result in a call
>>> to opal_progress, this should help balance the load a little bit
>>> better. Eugene's going to take a stab at implementing this short
>>> term.
>> I checked with Terry and we can't really recover the history here.
>> Perhaps draining ACKs is good enough. After the first message, we
>> can return.
> Ok recovering history here, not sure it matters though. First the
> performance issue George and I discussed and fixed is documented in
> thread
> As was mentioned this was only to retrieve ack packets and should
> not have any bearing on expanding the unexpected queue. The
> original change was r18724 and did not add line 432 mentioned below.
>> That's a one-line change. Just comment out line 432 ("goto
>> recheck_peer;") in
>> #432 .
> Line 432 was introduced by r19309 to fix ticket #1378. However
> something is more at hand because since Eugene's experiement show's
> removing this line doesn't help reduce the amount of unexpecteds.
>> Problem is, that doesn't "fix" things. That is, my deadlock
>> avoidance stuff (hg workspace on milliways that I sent out a
>> pointer to) seems to be enough to, well, avoid deadlock, but
>> unexpected-message queues are still growing like mad I think. Even
>> when sm progress returns after the first message fragment is
>> received. (X=1.) I think it's even true if the max free-list size
>> is capped at something small. I *think* (but am too tired to
>> "know") that the issue is we poll the FIFO often anyhow. We have
>> to for sends to reclaim fragments. We have to for receives, to
>> pull out messages of interest. Maybe things would be better if we
>> had one FIFO for in-coming fragments and another for returning
>> fragments. We could poll the latter only when we needed another
>> fragment for sending.
> So is the issue that Eugene describing is that one rank is flooding
> the other with so many messages that the flooded victim cannot see
> the FRAG_ACKs without draining the real (flooding) messages from the
> FIFO first?
> This seems like either having a separate FIFOs, as Eugene describes
> above, or instituting some type of flow control (number of inflight
> messages allowed) might help.
> --td
>> But I'm under pressure to shift my attention to other activities.
>> So, I think I'm going to abandon this effort. The flow control
>> problem seems thorny. I can think of fixes as fast as I can
>> identify flow-control problems, but the rate of new flow-control
>> problems just doesn't seem to abate. Meanwhile, my unexpected-work
>> queue grows unbounded. :^)
>>> I think the combination of Euegene's deadlock avoidance fix and
>>> the careful queue draining should make me comfortable enough to
>>> start another round of testing, but at least explains the bottom
>>> line issues.
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
> _______________________________________________
> devel mailing list
> devel_at_[hidden]