Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Flush CQ error on iWARP/Out-of-sync shutdown
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2008-05-06 10:39:23


On May 5, 2008, at 6:27 PM, Steve Wise wrote:

>> I am seeing some unusual behavior during the shutdown phase of ompi
>> at the end of my testcase. While running a IMB pingpong test over
>> the rdmacm on openib, I get cq flush errors on my iWARP adapters.
>>
>> This error is happening because the remote node is still polling
>> the endpoint while the other one shutdown. This occurs because
>> iWARP puts the qps in error state when the channel is disconnected
>> (IB does not do this). Since the cq is still being polled when the
>> event is received on the remote node, ompi thinks it hit an error
>> and kills the run. Since this is expected behavior on iWARP, this
>> is not really an error case.
>
> The key here, I think is that when an iWARP QP moves out of RTS, all
> the
> RECVs and any pending SQ WRs get flushed. Further, disconnecting the
> iwarp connection forces the QP out of RTS. This is probably different
> than they way IB works. IE "disconnecting" in IB is an out-of-band
> exchange done by the IBCM. For iWARP, "disconnecting" is an in-band
> operation (a TCP close or abort) so the QP cannot remain in RTS during
> this process.

Let me make sure I understand:

- proc A calls del_procs on proc B
- proc A calls ibv_destroy_qp() on QP to proc B
- this causes a local (proc A) flush on all pending receives and SQ WRs
- this then causes a FLUSH event to show up *in proc B*
   --> I'm not clear on this point from Jon's/Steve's text
- OMPI [currently] treats the FLUSH in proc B as an error

Is that right?

What is the purpose of the FLUSH event?

>> There is a larger question regarding why the remote node is still
>> polling the hca and not shutting down, but my immediate question is
>> if it is an acceptable fix to simply disregard this "error" if it
>> is an iWARP adapter.

If proc B is still polling the hca, it is likely because it simply has
not yet stopped doing it. I.e., a big problem in MPI implementations
is that not all actions are exactly synchronous. MPI disconnects are
*effectively* synchronous, but we probably didn't *guarantee*
synchronicity in this case because we didn't need it (perhaps until
now).

>> Opinions?
>>
>
> If the openib btl (or the layers above) assume the "disconnect" will
> notify the remote rank that the connection should be finalized, then
> we
> must deal with FLUSHED WRs for the iwarp case. If some sort of
> "finalizing" is done by OMPI and then the connections disconnected,
> then
> that "finalizing" should include not polling the CQ anymore. But
> that's
> not what we observe.

I'd have to check the exact shutdown sequence...

-- 
Jeff Squyres
Cisco Systems