Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2007-04-16 17:12:47

On Apr 15, 2007, at 10:25 PM, chaitali dherange wrote:

> To make things simple, we are making this scheduling static to some
> extent... by static I mean.. we know that our clusters use
> Infiniband for MPI ( from our study of the openmpi source code this
> precisely uses the 'mca_btl_openib_send()' from the ompi/mca/btl/
> openib/btl_openib.c file) ... so all the non MPI communication can
> be assumed to be TCP communication using the 'mca_btl_tcp_send()'
> from the ompi/mca/btl/tcp/btl_tcp.c file.
> To implement this we plan to implement the foll. simple algorithm:
> - before calling the 'mca_btl_openib_send()' lock0(X);
> - before calling the 'mca_btl_tcp_send()' lock1(X);
> Algo:
> 1. Allow Lock0(x) -> Lock0(x);.. meaning Lock0(x) is followed by
> Lock0(x).
> 2. Allow Lock1(x) -> Lock1(x);
> 3. Do not allow Lock0(x) -> Lock1(x);
> 4. If Lock1(x) -> Lock0(x).... since MPI calls are to be higher
> priority over the non MPI ones.. in this case the non MPI
> communication should be paused and all the related data off course
> needs to be put into a queue(meaning the status of this should be
> saved in a queue). All other non MPI communications newer than this
> should also be added to this same queue. Now the MPI process trying
> to perform Lock0(x) should be allowed to complete and only when all
> the MPI communications are complete should the non MPI
> communication be allowed.
> Currently we are working on a simple scheduling algorithm without
> giving any priorities to the 'MPI_send' calls.
> However to implement the project fully, we have the following
> queries :(
> -Can we abort or pause the non-MPI/TCP communication in any way???

Not really; the BTL interface was not really designed for that.
Indeed, even if you wrote your own socket code to use TCP sockets
outside of MPI / BTL / etc., you don't have full control of exactly
what is sent (or when). For example, if you write(fd, ...) and then
decide you want to pause it, how would you do so? You can stop
calling write(), but that's not enough. The kernel may have copied
your buffer to a lower level and may be progressing the actual send
behind the scenes. So you haven't *guaranteed* that only one network
interface is utilizing the host's resources (RAM, kernel, memory
busses, etc.) at one time.

Indeed, the BTL interface is designed to acknowledge this
asynchronicity -- it *assumes* that all network actions are non-
blocking such that a "Send" action only *begins* the send; completion
occurs later.

So even if you use the TCP BTL to queue up a bunch of writes, if you
then get an IB BTL send request, there isn't a good way to tell the
TCP BTL "stop doing anything until I tell you otherwise" (i.e., don't
process incoming reads and don't progress any further writes). :-\

> -Given the assumption that the non-MPI communication is TCP, can we
> make use of the built in structures (i mean the buffer already
> used) in
> mca_btl_tcp_send() for the implementation of pt.4 in the above
> mentioned
> algorithm??? and more importantly how?

Not really :-(. The BTLs, by design, are mutually unaware of each
other. In fact, the BTLs are quite dumb (as intended). The design
was to have the caller coordinate and perform any higher-level
coordination and the BTLs are simple bit-movers between processes.

Using the BTL's directly, the best you might be able to do is to stop
queuing up new messages to a secondary BTL until you have completions
from all pending traffic on a primary BTL. That might still be
interesting, but it may not give you everything that you want --
especially since a) I'm guessing that your ultimate goal may be to
multi-schedule multiple communication libraries across the *same*
interconnect, and b) given the asynchronous nature of parallel
computing, you might be able to do a half-decent job of *sending*
scheduling, but you may not be able to predict the behavior of
*receive* scheduling (e.g., how can you predict/schedule that a low
priority receive would not be occurring at the same time on the same
node as a high priority send?).

> Regards,
> Chaitali
> _______________________________________________
> devel mailing list
> devel_at_[hidden]

Jeff Squyres
Cisco Systems