George Bosilca wrote:
> MPI does not impose any global order on the messages. The only
> requirement is that between two peers on the same communicator the
> messages (or at least the part required for the matching) is
> delivered in order. This make both execution traces you sent with
> your original email (shared memory and TCP) valid from the MPI
> Moreover, MPI doesn't impose any order in the matching when
> ANY_SOURCE is used. In Open MPI we do the matching _ALWAYS_ starting
> from rank 0 to n in the specified communicator. BEWARE: The remaining
> of this paragraph is deep black magic of an MPI implementation
> internals. The main difference between the behavior of SM and TCP
> here directly reflect their eager size, 4K for SM and 64K for TCP.
> Therefore, for your example, for TCP all your messages are eager
> messages (i.e. are completely transfered to the destination process
> in just one go), while for SM they all require a rendez-vous. This
> directly impact the ordering of the messages on the receiver, and
> therefore the order of the matching. However, I have to insist on
> this, this behavior is correct based on the MPI standard specifications.
I'm going to try a technical explanation of what's going on inside OMPI
and then words of advice to Mark.
First, the technical explanation. As George says, what's going on is
legal. The "servers" all queue up sends to the "compositor". These are
long, rendezvous sends (at least when they're on-node). So, none of
these sends completes. The compositor looks for an in-coming message.
It's gets the header of the message and sends back an acknowledgement
that the rest of the message can be sent. The "server" gets the
acknowledgement and starts sending more of the message. The compositor,
in order to get to the remainder of the message, keeps draining all the
other stuff servers are sending it. Once the first message is
completely received, the compositor looks for the next message to
process and happens to pick up the first server again. It won't go to
anyone else under server 1 is exhausted. Legal, but from Mark's point
of view not desirable. The compositor is busy all the time. Mark just
wants it to employ a different order.
The receives are "serialized". Of course they must be since the
receiver is a single process. But Mark's performance issue is that the
servers aren't being serviced equally. So, they back up while server
unfairly gets all the attention.
Mark, your test code has a set of buffers it cycles through on each
server. Could you do something similar on the compositor side? Have a
set of resources for each server. If you want the compositor to service
all servers equally/fairly, you're going to have to prescribe this
behavior in your MPI code. The MPI implementation can't be relied on to
do this for you.
If this doesn't make sense, let me know and I'll try to sketch is out