Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Threaded progress for CPCs
From: Pavel Shamis (Pasha) (pasha_at_[hidden])
Date: 2008-05-19 10:08:17


>> 1. When CM progress thread completes an incoming connection, it sends
>> a command down a pipe to the main thread indicating that a new
>> endpoint is ready to use. The pipe message will be noticed by
>> opal_progress() in the main thread and will run a function to do all
>> necessary housekeeping (sets the endpoint state to CONNECTED, etc.).
>> But it is possible that the receiver process won't dip into the MPI
>> layer for a long time (and therefore not call opal_progress and the
>> housekeeping function). Therefore, it is possible that with an active
>> sender and a slow receiver, the sender can overwhelm an SRQ. On IB,
>> this will just generate RNRs and be ok (we configure SRQs to have
>> infinite RNRs), but I don't understand the semantics of what will
>> happen on iWARP (it may terminate? I sent an off-list question to
>> Steve Wise to ask for detail -- we may have other issues with SRQ on
>> iWARP if this is the case, but let's skip that discussion for now).
>>
>>
> Is it possible to have sane SRQ implementation without HW flow control?
> Anyway the described problem exists with SRQ right now too. If receiver
> doesn't enter progress for a long time sender can overwhelm an SRQ.
> I don't see how this can be fixed without progress thread (and I am not
> even sure that this is the problem that has to be fixed).
>
It may be resolved particularly by srq_limit_event (this event is
generated when number posted receive buffer come down under predefined
watermark )
But I'm not sure that we want to move the RNR problem from sender side
to receiver.

The full solution will be progress thread + srq_limit_event.

>
>> Even if we can get the iWARP semantics to work, this feels kinda
>> icky. Perhaps I'm overreacting and this isn't a problem that needs to
>> be fixed -- after all, this situation is no different than what
>> happens after the initial connection, but it still feels icky.
>>
> What is so icky about it? Sender is faster than a receiver so flow control
> kicks in.
>
>
>> 2. The CM progress thread posts its own receive buffers when creating
>> a QP (which is a necessary step in both CMs). However, this is
>> problematic in two cases:
>>
>>
> [skip]
>
> I don't like 1,2 and 3. :(
>
If Iwarp may handle RNR , #1 - sounds ok for me, at least for 1.3.
>
>> 4. Have a separate mpool for drawing initial receive buffers for the
>> CM-posted RQs. We'd probably want this mpool to be always empty (or
>> close to empty) -- it's ok to be slow to allocate / register more
>> memory when a new connection request arrives. The memory obtained
>> from this mpool should be able to be returned to the "main" mpool
>> after it is consumed.
>>
>
> This is slightly better, but still...
>
>
>> 5. ...?
>>
> What about moving posting of receive buffers into main thread. With
> SRQ it is easy: don't post anything in CPC thread. Main thread will
> prepost buffers automatically after first fragment received on the
> endpoint (in btl_openib_handle_incoming()).
It still doesn't guaranty that we will not see RNR (as I understand we
trying to resolve this problem for iwarp?!)

So this solution will cost 1 buffer on each srq ... sounds acceptable
for me. But I don't see too much
difference compared to #1, as I understand we anyway will be need the
pipe for communication with main thread.
so why don't use #1 ?
> With PPRQ it's more
> complicated. What if we'll prepost dummy buffers (not from free list)
> during IBCM connection stage and will run another three way handshake
> protocol using those buffers, but from the main thread. We will need to
> prepost one buffer on the active side and two buffers on the passive side.
>
> --
> Gleb.
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>