Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Steve Wise (swise_at_[hidden])
Date: 2007-05-09 15:54:58


On Wed, 2007-05-09 at 11:42 -0400, Donald Kerr wrote:
> I agree OMPI trac ticket #890 should cover this. I will test the
> suggested fix, just removing that one line from btl_udapl.c, on Solaris.
> I am still not set up on Linux so hopefully Steve can confirm there.
>

All,

First, I haven't tested Arlins dat_ep_query() fix yet as we have
determined its not needed. The OMPI udapl btl never calls
dat_ep_query()...

So running OMPI with the suggested fix (removing the overwriting of the
hca_addr port field in btl_udapl.c) over ofed udapl on chelsio's iwarp
rnic still doesn't work.

There are two new issues so far:

1) this has uncovered a connection migration issue in the Chelsio
driver/firmware. We are developing and testing a fix for this now.
Should be ready tomorrow hopefully.

2) OMPI is not adhering to the iwarp protocol requirement that the ULP,
in this case OMPI, initiating the iwarp connection (the side issuing the
dat_ep_connect() or rdma_connect()) _MUST_ be the first to send an RDMA
message. So if a OMPI process _accepts_ an rdma connection, then it
cannot send on that connection until it receives some sort of rdma
operation from the client process. It appears the current OMPI
connection setup model doesn't enforce this.

This combined with the bug above causes an immediate connection failure
on chelsio's rnic. After I fix #1 above, things might get slightly
better but my guess is we will still have connection setup problems if
the server side sends before the client side finishes streaming->rdma
mode transition.

There have been a series of discussions on the ofa general list about
this issue, and the conclusion to date is that it cannot be resolved in
the rdma-cm or iwarp-cm code of the linux rdma stack. Mainly because
sending an RDMA message involves the ULP's work queue and completion
queue, so the CM cannot do this under the covers in a mannor that
doesn't affect the application. Thus, the applications must deal with
this.

Here is a possible solution:

I assume in OMPI that connections are only initiated when the mpi
application does a send operation. Given that, then udapl btl must
ensure that if a given rank accepts a connection, it cannot not send
anything until the rank at the other end of the connection sends first.
Since the other side initiated the connection, it will have pending data
to send...

I haven't looked into how painful this will be to implement.

Thoughts?

FYI:

IETF Draft requiring this behavior:

http://www.ietf.org/internet-drafts/draft-ietf-rddp-mpa-08.txt

See section 7 for specifics.

Steve.