>> 1. When CM progress thread completes an incoming connection, it sends
>> a command down a pipe to the main thread indicating that a new
>> endpoint is ready to use. The pipe message will be noticed by
>> opal_progress() in the main thread and will run a function to do all
>> necessary housekeeping (sets the endpoint state to CONNECTED, etc.).
>> But it is possible that the receiver process won't dip into the MPI
>> layer for a long time (and therefore not call opal_progress and the
>> housekeeping function). Therefore, it is possible that with an active
>> sender and a slow receiver, the sender can overwhelm an SRQ. On IB,
>> this will just generate RNRs and be ok (we configure SRQs to have
>> infinite RNRs), but I don't understand the semantics of what will
>> happen on iWARP (it may terminate? I sent an off-list question to
>> Steve Wise to ask for detail -- we may have other issues with SRQ on
>> iWARP if this is the case, but let's skip that discussion for now).
> Is it possible to have sane SRQ implementation without HW flow control?
> Anyway the described problem exists with SRQ right now too. If receiver
> doesn't enter progress for a long time sender can overwhelm an SRQ.
> I don't see how this can be fixed without progress thread (and I am not
> even sure that this is the problem that has to be fixed).
It may be resolved particularly by srq_limit_event (this event is
generated when number posted receive buffer come down under predefined
But I'm not sure that we want to move the RNR problem from sender side
The full solution will be progress thread + srq_limit_event.
>> Even if we can get the iWARP semantics to work, this feels kinda
>> icky. Perhaps I'm overreacting and this isn't a problem that needs to
>> be fixed -- after all, this situation is no different than what
>> happens after the initial connection, but it still feels icky.
> What is so icky about it? Sender is faster than a receiver so flow control
> kicks in.
>> 2. The CM progress thread posts its own receive buffers when creating
>> a QP (which is a necessary step in both CMs). However, this is
>> problematic in two cases:
> I don't like 1,2 and 3. :(
If Iwarp may handle RNR , #1 - sounds ok for me, at least for 1.3.
>> 4. Have a separate mpool for drawing initial receive buffers for the
>> CM-posted RQs. We'd probably want this mpool to be always empty (or
>> close to empty) -- it's ok to be slow to allocate / register more
>> memory when a new connection request arrives. The memory obtained
>> from this mpool should be able to be returned to the "main" mpool
>> after it is consumed.
> This is slightly better, but still...
>> 5. ...?
> What about moving posting of receive buffers into main thread. With
> SRQ it is easy: don't post anything in CPC thread. Main thread will
> prepost buffers automatically after first fragment received on the
> endpoint (in btl_openib_handle_incoming()).
It still doesn't guaranty that we will not see RNR (as I understand we
trying to resolve this problem for iwarp?!)
So this solution will cost 1 buffer on each srq ... sounds acceptable
for me. But I don't see too much
difference compared to #1, as I understand we anyway will be need the
pipe for communication with main thread.
so why don't use #1 ?
> With PPRQ it's more
> complicated. What if we'll prepost dummy buffers (not from free list)
> during IBCM connection stage and will run another three way handshake
> protocol using those buffers, but from the main thread. We will need to
> prepost one buffer on the active side and two buffers on the passive side.
> devel mailing list