Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2007-04-04 10:23:44

On Apr 3, 2007, at 4:57 PM, pooja_at_[hidden] wrote:

> I need to find when the underlying network is free. Means I dont
> need to
> go into the details of how MPi_send is implemented.

Ah, ok. That explains a lot.

> What I want to know is when the MPI_Send is started .Or rather when
> MPi
> does not use the underlying network.
> I need to find timing for
> 1) When the application issue send command

This (and #5) can be implemented with a PMPI-based intercept library
(I assume that by "command", you mean "API function call").

> 2) When Mpi actually issues send command
> 3) When does BTl perform atual transfer(send)

What are you looking to distinguish here? I.e., what is the
difference between 1 and 2 vs. 3?

Open MPI has an MPI_Send() function in C that does some error
checking and then invokes an underlying "send" function (via function
pointer) to a plugin that starts doing the setup for the MPI
semantics for the send. Eventually, another function pointer is used
to invoke the "send" function in the BTL to actually send the
message. More setup is performed down in the BTL (usually dealing
with setting up data structures to invoke the underlying network/OS/
driver "send" function that starts the network send), and then we
invoke some underlying OS/kernel-bypass function to start the network
transfer. Note that all we can guarantee is that the transfer start
sometime after that -- there's no way to know *exactly* when it
starts because the underlying kernel driver may choose to defer it
for a while based on flow control, available resources, etc.

Specifically, similar to one of my prior e-mails, the calling
structure is something like this:

   --> PML plugin (usually the "ob1" plugin)
      --> BTL plugin (one of the components in the ompi/mca/btl/
         --> underlying OS/kernel-bypass function

> 4) When doe send complete

By "complete", what exactly are you looking for? There's several
definitions possible here:

- when any of the "send" functions listed above returns
- when the underlying network driver tells us that it is complete
(a.k.a. "local completion" -- it *DOES NOT* imply that the receiver
has even started to receive the message, nor that the message has
even left the host yet)
- when he receiver ACK's receiving the message
- when MPI_Send() returns

FWIW, we usually measure local completion time because that's all
that we can know (because the underlying network driver makes its own
decisions about when messages are put out on the network, etc., and
we [i.e., any user-level networking software] don't have visibility
of that information).

> 5) Who was thr receiver.
> etc. this was an example of MPi_send.
> like this I need to know MPI_Isend,broadcast etc.
> I guess this can be done using PMPI.

Some of this can, yes.

> But PMPI can do it during profile stages while I want all this data
> during
> runtime.

I don't quite understand this statement -- PMPI is a run-time
profiling system. All it does is insert your shim PMPI layer between
the user's application and the "real" MPI layer.

> So that I can improve the performance of the system while using
> that ideal
> time.

What I'm piecing together from your e-mails is that you want to use
MPI in conjunction with using the network directly, either through
the BTLs or some other communication library (i.e., both MPI and your
other communication library will share the same network resources)
and you're trying to find out when MPI is not using a particular
network resource so that you can use it with your other communication
library in order to maximize utilization and minimize contention /
congestion. Is that correct?

Is that right?

> Well I/o used is Lustre (its ROMIO).

Note that ROMIO uses a fair bit of MPI sending and receiving before
using the underlying file system. So you'll have at least 2 layers
of software to explore to find out when the network is free/busy.

> What I mean by I/O node is nodes that does input and ouput
> processing i.e
> they write to lustre and compute node just transfer data to i/o
> node to
> write it in Lustre.Compute node does not have memory at all.So when
> ever
> they have something to write it gets transfered to I/o node.
> and then I/o node does read and write.

Ok. I'm guessing/assuming that this is multiplexing that is either
done in ROMIO or in Lustre itself.

> So when MPi_send is not issued the the network(Infiniband
> interconnect)
> can be used for some other transfer.

Makes sense.

> Can anyone help me wih how to go abt tracing this at run time?

The BTL plugin that you will be concerned with is the "openib" BTL
(in the Open MPI source tree: ompi/mca/btl/openib/*) -- assuming that
you are using an OpenFabrics/OFED-based network driver on your nodes
(if you're using an older mvapi-based network driver, you'll use the
mvapi BTL: ompi/mca/btl/mvapi/* -- but I would not recommend this
because all current and future effort for InfiniBand in OMPI is being
doing with OFED/the openib BTL).

Be warned: IB networks are highly flexible and therefore the API for
it is fairly complex. The native API for OFED-based IB verbs is in a
library called "libibverbs" -- man pages for the particular verbs
function calls will be included in the next OFED release (OFED v1.2),
so you probably don't have them loaded on your cluster. I've
attached a tarball of the man pages for you. You'll need these man
pages to understand what the openib BTL is doing.

If what you want to do is figure out when OMPI's openib BTL is not
using the network, you need to a) understand the BTL interface (and
to some extent, how the "ob1" PML uses it), b) understand at least
generally how the InfiniBand verbs API (functions that begin with
ibv_*()) work, and c) understand how the openib BTL works.

To that end, I'd suggest the following:

a) Understand the BTL interface: see ompi/mca/btl/btl.h for the BTL
plugin interface and at least some comments about how it is used.
Also see the slides from the OMPI Developer's Workshop (especially
Wednesday, the day where point-to-point communications were covered):

b) Read up on the IBV function call man pages. Understand a few
major concepts before starting:

- Using the IB network requires the use of "registered" memory.
Meaning that any messages sent or received across the IB network must
use the special "registered" memory. OMPI dedicates a *lot* of code
to managing registered memory (you'll see references to the rdma
mpool [memory pool] component in the OMPI trunk -- it's slightly
different on the v1.2 series branch)

- Most IB actions are asynchronous. So when we send a message, you
simply create a work queue entry (WQE) and put it on a work queue
(WQ). The IB NIC takes over from there and progresses the send.
When the send has completed (local completion only; does not imply
that the receiver has even started to receive), an entry will appear
on the completion queue (CQ) telling OMPI that it is done and the
message buffer can now be re-used/deallocated/whatever.

- Note that registering and de-registering memory is synchronous and
can be fairly expensive (i.e., slow).

- All IB communication is done through queue pairs (QPs): a send
queue and a receive queue. QP's are analogous to sockets -- you open
a QP between two processes. You then send to that peer by creating a
WQE for the send queue and putting it on the WQ. The NIC will then
progress the send buffer and when local completion occurs, will put
an entry on the CQ.

- There is no such thing as an unexpected message in IB -- you *must*
pre-post buffers to receive messages. Hence, OMPI posts a bunch of
buffers during init to receive messages via received queues in QPs.
These buffers are used when you use "send / receive" semantics for IB
message passing.

- Additionally, IB networks can utilize RDMA for message passing --
meaning that you don't send messages into pre-posted received
buffers, but rather give an address to send the message directly to
in the peer process. This is called "RDMA semantics" for IB message
passing (as opposed to "send / receive semantics"). There is some
additional cost to this form of message passing because you have to
exchange the target address from the receiver to the sender.

- OMPI makes QPs lazily. That is, there is a bunch of code dedicated
to coordinating and creating QPs when the first MPI_SEND is exchanged
between a pair of MPI peer processes. Specifically, if you have an
MPI process that calls MPI_INIT and MPI_FINALIZE (and no MPI_SEND/
MPI_RECV functions), we won't make any QP's between MPI processes.

- The openib BTL generally does the following:
   - For short messages, RDMA is used for a limited set of peer
processes (because RMDA consumes registered memory). Specifically,
the first N processes that connect to a given process will be allowed
to use RDMA for short messages.
   - For the N+1st (and beyond) peer that connects, send/receive
semantics are used.
   - A complex protocol is used for long messages. It is described
in this paper:

- Open MPI also employs the PERUSE statistics-gathering profiler,
which may be helpful to you. See this paper for details:

- This paper also describes some scalability issues with IB
(particularly with consuming registered memory and something called
shared receive queues [SRQ]):

Hope this is helpful to you.

Jeff Squyres
Cisco Systems