Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Flush CQ error on iWARP/Out-of-sync shutdown
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2008-05-06 13:00:17

In addition to Steve's comments, we discussed this on the call today
and decided that the patch is fine.

Jon and I will discuss further because this is the first instance of
calling some form of "disconnect" on one side causes events to occur
on the other side without the involvement from the remote OMPI (e.g.,
the remote side's OMPI layer simply hasn't called its "disconnect"
flavor yet, but the kernel level transport/network stack will cause
things to happen on the remote side anyway).

On May 6, 2008, at 11:45 AM, Steve Wise wrote:

> Jeff Squyres wrote:
>> On May 5, 2008, at 6:27 PM, Steve Wise wrote:
>>>> I am seeing some unusual behavior during the shutdown phase of ompi
>>>> at the end of my testcase. While running a IMB pingpong test over
>>>> the rdmacm on openib, I get cq flush errors on my iWARP adapters.
>>>> This error is happening because the remote node is still polling
>>>> the endpoint while the other one shutdown. This occurs because
>>>> iWARP puts the qps in error state when the channel is disconnected
>>>> (IB does not do this). Since the cq is still being polled when the
>>>> event is received on the remote node, ompi thinks it hit an error
>>>> and kills the run. Since this is expected behavior on iWARP, this
>>>> is not really an error case.
>>> The key here, I think is that when an iWARP QP moves out of RTS, all
>>> the
>>> RECVs and any pending SQ WRs get flushed. Further, disconnecting
>>> the
>>> iwarp connection forces the QP out of RTS. This is probably
>>> different
>>> than they way IB works. IE "disconnecting" in IB is an out-of-band
>>> exchange done by the IBCM. For iWARP, "disconnecting" is an in-band
>>> operation (a TCP close or abort) so the QP cannot remain in RTS
>>> during
>>> this process.
>> Let me make sure I understand:
>> - proc A calls del_procs on proc B
>> - proc A calls ibv_destroy_qp() on QP to proc B
> Actually proc A calls rdma_disconnect() on QP to proc B
>> - this causes a local (proc A) flush on all pending receives and SQ
>> WRs
>> - this then causes a FLUSH event to show up *in proc B*
>> --> I'm not clear on this point from Jon's/Steve's text
> Yes. Once the connection is torn down the iwarp QPs will be flushed
> on
> both ends.
>> - OMPI [currently] treats the FLUSH in proc B as an error
>> Is that right?
>> What is the purpose of the FLUSH event?
> In general, I think it is to allow the application to recover any
> resources that are allocated and cannot be touched until the WRs
> complete. For example, the buffers that were described in all the
> WRs. If the app is going to exit, this isn't very interesting since
> everything will get cleaned up in the exit path. But if the process
> is
> long lived and setting up/tearing down connections, then these pending
> RECV buffers need to be reclaimed and put back into the buffer poll,
> as
> an example...
>>>> There is a larger question regarding why the remote node is still
>>>> polling the hca and not shutting down, but my immediate question is
>>>> if it is an acceptable fix to simply disregard this "error" if it
>>>> is an iWARP adapter.
>> If proc B is still polling the hca, it is likely because it simply
>> has
>> not yet stopped doing it. I.e., a big problem in MPI implementations
>> is that not all actions are exactly synchronous. MPI disconnects are
>> *effectively* synchronous, but we probably didn't *guarantee*
>> synchronicity in this case because we didn't need it (perhaps until
>> now).
> Yes.
> Steve.
> _______________________________________________
> devel mailing list
> devel_at_[hidden]

Jeff Squyres
Cisco Systems