Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Brad Penoff (penoff_at_[hidden])
Date: 2007-10-22 15:06:12


We had some questions about the best way to make use of Open MPI's
features for a new BTL... the general theme is making use of the
opal_event's versus a btl_progress function. When is it best to do
one versus the other?

We are working on several designs for an SCTP BTL for Open MPI. The
familiar one is to use "TCP-style" one-to-one sockets, which have a
socket per endpoint pair, just like the TCP BTL does now. However, a
more unfamiliar one is to use a single "UDP-style" one-to-many socket
per BTL. To illustrate, pretend you have 3 processes... each process
only has one socket upon which connections are established, messages
are sent, and messages are received to/from the other two processes.
It is this design that currently we have some questions about....

So far, we have not been implementing our own btl_progress function.
This means that within opal_progress(), poll() is called based on the
opal events registered within the BTL. Like TCP, for example, when an
MPI_Send happens, the endpoint_send_event is added and POLLOUT is
added for this socket for a given endpoint. Since MPI_Send is
blocking, it doesn't really matter that this socket is used for other
btl_endpoints because it is the only endpoint with an opal event for
sending added. However, this is not the case with non-blocking...

When we have multiple outstanding non-blocking requests to different
endpoints, we have to queue them since the endpoints share the same
one-to-many socket and events are associated with a single

>From proc C, say we have this pseudo code running:
iSend(proc A)
iSend(proc B)

Within Waitall, our current design using opal events has the iSend to
proc A eventually complete but prior to this, the iSend to proc B
can't start until proc A's is done. We currently queue the endpoints
waiting for the poll() POLLOUT event and dequeue from this queue when
the event from proc A's endpoint is deleted (and add proc B's endpoint
to the POLLOUT event).

Can you think of a way using the existing framework to eliminate the
restriction of the send to proc B having to complete prior to the send
to proc B starting? We were trying to use the existing framework but
for our case, it may make more sense to implement our own btl_progress
function since poll() doesn't really make sense for a single socket
anyway... Do you think that would be best?

We noticed that mca_bml_r2_progress calls btl_progress[i]() which is
set in mca_bml_r2_add_procs if NULL !=
btl->btl_component->btl_progress. Is there an example of a btl that
implements its own btl_progress function? I just want to make sure
this is even a possibility before traveling down this path... and
maybe learn from others prior.

Thanks ahead of time for any help!