Jeff Squyres wrote:
> On May 19, 2008, at 4:44 PM, Steve Wise wrote:
>>> 1. Posting more at low watermark can lead to DoS-like behavior when
>>> you have a fast sender and a slow receiver. This is exactly the
>>> resource-exhaustion kind of behavior that a high quality MPI
>>> implementation is supposed to avoid -- we really should to throttle
>>> the sender somehow.
>>> 2. Resending ad infinitum simply eats up more bandwidth and takes
>>> network resources (e.g., switch resources) that other, legitimate
>>> traffic. Particularly if the receiver doesn't dip into the MPI layer
>>> for many hours. So yes, it *works*, but it's definitely sub-optimal.
>> The SRQ low water mark is simply an API method to allow applications
>> try and never hit the "we're totally out recv bufs" problem. That's a
>> tool that I think is needed for srq users no matter what flow control
>> method you use to try and avoid jeff's #1 item above.
> If you had these buffers available, why didn't you post them when the
> QP was created / this sender was added?
Because you're trying to reduce memory requirements at the expense of
under-provisioning the SRQ. If you don't want the transport to drop and
retransmit, then you might want an algorithm to increase the low water
mark during bursty periods.
> This mechanism *might* make sense if there was a sensible approach to
> know when to remove the "additional" buffers posted to an SRQ due to
> bursty traffic. But how do you know when that is?
Thinking out loud:
- keep the SRQ up to the low water mark as a normal course of events
- increase the low water mark value as you get more and more "low
water mark exceeded" events
- decrease the low water mark as these events become less frequent.
Dunno if this is worth the effort.
>> And if you don't like RNR retry/TCP retrans approach, which is bad for
>> reason #2 (and because TCP will eventually give up and reset the
>> connection), then I think there needs to be some OMPI layer
>> protocol to
>> stop senders that are abusing the SRQ pool for whatever reason (too
>> of a sender, sleeping beauty receiver never entering OMPI layer,
> That implies a progress thread. If/when we add a progress thread, it
> will likely be for progressing long messages. Myricom and MVAPICH
> have shown that rapidly firing progress threads and problematic to
> performance. But even if you have that progress thread *only* wake up
> on the low watermark for the SRQ, you have two problems:
> - there still could be many inbound messages that will overflow the
> SRQ and/or even more could be inbound by the time your STOP message
> gets to everyone (gets even worse as the MPI job scales up in total
> number of processes)
> - in the case of a very large MPI job, sending the STOP message has
> obvious scalability problems (have to send it to everyone, which
> requires its own set of send buffers and WQEs/CQEs)
Ok, STOP messages won't scale...dumb idea.