Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2007-10-22 20:19:20

On Oct 22, 2007, at 3:06 PM, Brad Penoff wrote:

> We had some questions about the best way to make use of Open MPI's
> features for a new BTL... the general theme is making use of the
> opal_event's versus a btl_progress function. When is it best to do
> one versus the other?

In our Paris engineering meeting, we had a lengthy discussion about a
related topic. The end result of our conversation will result in a
few things:

- We'll be updating libevent in the not-distant future (see previous
mail today about that)
- After updating libevent, we'll be updating to use the more modern
epoll (and friends) interfaces. They're manually disabled [with good
reason] in our libevent for reasons that are too boring to describe
(but I can if you care).
- BTLs with a device under them are free to use libevent for fd-based
progress and/or a progress function. Software layers without
underlying devices should not use progress functions.
- We'll eventually be adding a blocking interface to the BTLs. More
info TBD on that.

> We are working on several designs for an SCTP BTL for Open MPI. The
> familiar one is to use "TCP-style" one-to-one sockets, which have a
> socket per endpoint pair, just like the TCP BTL does now. However, a
> more unfamiliar one is to use a single "UDP-style" one-to-many socket
> per BTL. To illustrate, pretend you have 3 processes... each process
> only has one socket upon which connections are established, messages
> are sent, and messages are received to/from the other two processes.
> It is this design that currently we have some questions about....
> So far, we have not been implementing our own btl_progress function.
> This means that within opal_progress(), poll() is called based on the
> opal events registered within the BTL. Like TCP, for example, when an
> MPI_Send happens, the endpoint_send_event is added and POLLOUT is
> added for this socket for a given endpoint. Since MPI_Send is
> blocking, it doesn't really matter that this socket is used for other
> btl_endpoints because it is the only endpoint with an opal event for
> sending added. However, this is not the case with non-blocking...
> When we have multiple outstanding non-blocking requests to different
> endpoints, we have to queue them since the endpoints share the same
> one-to-many socket and events are associated with a single
> btl_endpoint.
>> From proc C, say we have this pseudo code running:
> iSend(proc A)
> iSend(proc B)
> Waitall()
> Within Waitall, our current design using opal events has the iSend to
> proc A eventually complete but prior to this, the iSend to proc B
> can't start until proc A's is done. We currently queue the endpoints
> waiting for the poll() POLLOUT event and dequeue from this queue when
> the event from proc A's endpoint is deleted (and add proc B's endpoint
> to the POLLOUT event).
> Can you think of a way using the existing framework to eliminate the
> restriction of the send to proc B having to complete prior to the send
> to proc B starting?

I assume you meant "send to proc *A* having to complete..."

> We were trying to use the existing framework but for our case, it
> may make more sense to implement our own btl_progress function
> since poll() doesn't really make sense for a single socket
> anyway... Do you think that would be best?

I guess I don't quite understand -- are you saying that you can have
2 concurrent writes occurring on the same socket to 2 different

If so, and if libevent doesn't match the SCTP paradigm, then I say:
sure, write your own progress function.

George: can you confirm / deny?

> We noticed that mca_bml_r2_progress calls btl_progress[i]() which is
> set in mca_bml_r2_add_procs if NULL !=
> btl->btl_component->btl_progress. Is there an example of a btl that
> implements its own btl_progress function? I just want to make sure
> this is even a possibility before traveling down this path... and
> maybe learn from others prior.

The openib btl has its own progress function.

Jeff Squyres
Cisco Systems