Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2007-10-22 20:19:20


On Oct 22, 2007, at 3:06 PM, Brad Penoff wrote:

> We had some questions about the best way to make use of Open MPI's
> features for a new BTL... the general theme is making use of the
> opal_event's versus a btl_progress function. When is it best to do
> one versus the other?

In our Paris engineering meeting, we had a lengthy discussion about a
related topic. The end result of our conversation will result in a
few things:

- We'll be updating libevent in the not-distant future (see previous
mail today about that)
- After updating libevent, we'll be updating to use the more modern
epoll (and friends) interfaces. They're manually disabled [with good
reason] in our libevent for reasons that are too boring to describe
(but I can if you care).
- BTLs with a device under them are free to use libevent for fd-based
progress and/or a progress function. Software layers without
underlying devices should not use progress functions.
- We'll eventually be adding a blocking interface to the BTLs. More
info TBD on that.

> We are working on several designs for an SCTP BTL for Open MPI. The
> familiar one is to use "TCP-style" one-to-one sockets, which have a
> socket per endpoint pair, just like the TCP BTL does now. However, a
> more unfamiliar one is to use a single "UDP-style" one-to-many socket
> per BTL. To illustrate, pretend you have 3 processes... each process
> only has one socket upon which connections are established, messages
> are sent, and messages are received to/from the other two processes.
> It is this design that currently we have some questions about....
>
> So far, we have not been implementing our own btl_progress function.
> This means that within opal_progress(), poll() is called based on the
> opal events registered within the BTL. Like TCP, for example, when an
> MPI_Send happens, the endpoint_send_event is added and POLLOUT is
> added for this socket for a given endpoint. Since MPI_Send is
> blocking, it doesn't really matter that this socket is used for other
> btl_endpoints because it is the only endpoint with an opal event for
> sending added. However, this is not the case with non-blocking...
>
> When we have multiple outstanding non-blocking requests to different
> endpoints, we have to queue them since the endpoints share the same
> one-to-many socket and events are associated with a single
> btl_endpoint.
>
>> From proc C, say we have this pseudo code running:
> iSend(proc A)
> iSend(proc B)
> Waitall()
>
> Within Waitall, our current design using opal events has the iSend to
> proc A eventually complete but prior to this, the iSend to proc B
> can't start until proc A's is done. We currently queue the endpoints
> waiting for the poll() POLLOUT event and dequeue from this queue when
> the event from proc A's endpoint is deleted (and add proc B's endpoint
> to the POLLOUT event).
>
> Can you think of a way using the existing framework to eliminate the
> restriction of the send to proc B having to complete prior to the send
> to proc B starting?

I assume you meant "send to proc *A* having to complete..."

> We were trying to use the existing framework but for our case, it
> may make more sense to implement our own btl_progress function
> since poll() doesn't really make sense for a single socket
> anyway... Do you think that would be best?

I guess I don't quite understand -- are you saying that you can have
2 concurrent writes occurring on the same socket to 2 different
destinations?

If so, and if libevent doesn't match the SCTP paradigm, then I say:
sure, write your own progress function.

George: can you confirm / deny?

> We noticed that mca_bml_r2_progress calls btl_progress[i]() which is
> set in mca_bml_r2_add_procs if NULL !=
> btl->btl_component->btl_progress. Is there an example of a btl that
> implements its own btl_progress function? I just want to make sure
> this is even a possibility before traveling down this path... and
> maybe learn from others prior.

The openib btl has its own progress function.

-- 
Jeff Squyres
Cisco Systems