On May 19, 2008, at 4:44 PM, Steve Wise wrote:
>> 1. Posting more at low watermark can lead to DoS-like behavior when
>> you have a fast sender and a slow receiver. This is exactly the
>> resource-exhaustion kind of behavior that a high quality MPI
>> implementation is supposed to avoid -- we really should to throttle
>> the sender somehow.
>> 2. Resending ad infinitum simply eats up more bandwidth and takes
>> network resources (e.g., switch resources) that other, legitimate
>> traffic. Particularly if the receiver doesn't dip into the MPI layer
>> for many hours. So yes, it *works*, but it's definitely sub-optimal.
> The SRQ low water mark is simply an API method to allow applications
> try and never hit the "we're totally out recv bufs" problem. That's a
> tool that I think is needed for srq users no matter what flow control
> method you use to try and avoid jeff's #1 item above.
If you had these buffers available, why didn't you post them when the
QP was created / this sender was added?
This mechanism *might* make sense if there was a sensible approach to
know when to remove the "additional" buffers posted to an SRQ due to
bursty traffic. But how do you know when that is?
> And if you don't like RNR retry/TCP retrans approach, which is bad for
> reason #2 (and because TCP will eventually give up and reset the
> connection), then I think there needs to be some OMPI layer
> protocol to
> stop senders that are abusing the SRQ pool for whatever reason (too
> of a sender, sleeping beauty receiver never entering OMPI layer,
That implies a progress thread. If/when we add a progress thread, it
will likely be for progressing long messages. Myricom and MVAPICH
have shown that rapidly firing progress threads and problematic to
performance. But even if you have that progress thread *only* wake up
on the low watermark for the SRQ, you have two problems:
- there still could be many inbound messages that will overflow the
SRQ and/or even more could be inbound by the time your STOP message
gets to everyone (gets even worse as the MPI job scales up in total
number of processes)
- in the case of a very large MPI job, sending the STOP message has
obvious scalability problems (have to send it to everyone, which
requires its own set of send buffers and WQEs/CQEs)