Sorry about the premature send...
The basic mechanics of this is similar to the problem with the portals
BTL that I fixed. However, in my case, the problem manifested itself
with the Intel test MPI_Send_Fairness_c (and MPI_Isend_Fairness_c) at
60 processes (the limit that MTT imposes on the Intel tests).
The original code followed the portals design document for MPI pretty
well. When the receiver is overwhelmed, a "reject" entry is used to
handle the excess messages. One of the features of this "reject" entry
is that the receiver (at the BTL level) never interacts with the actual
message. The problem was that the sender did not recognize the return
ACK from portals [in mca_btl_portals_component_progress()] as a failure.
So, the sender did not resend a message that the receiver was expecting.
While I fixed it in the trunk, I had to disable mca_btl_portals_sendi()
because there is a potential for this function to be used with a 0-byte
portals message payload. For this particular test,
https://svn.open-mpi.org/trac/ompi/ticket/1791, we would not have
seen a failure because, the root process would not know that it had
missed a message and the non-root processes would not have diagnosed
a need to resend. As corrected, the root process still is FD&H and
the non-root processes will keep re-transmitting until success.
Sorry for boring you about portals. In the sm case, the non-root
processes continually are appending to FIFOs. Obviously, these blasters
can append to the FIFOs much more quickly than the processes can remove
S7 --> S0
S6 --> S1
S5 --> S2
S4 --> S3
In the first cycle, everyone is busy. In the second cycle, S7, S6, S5,
and S4 are ready for the next reduction, but S3, S2, S1, and S0 still
are on the hook, meaning that the latter FIFOs are going to grow at
a faster rate:
S3 --> S0
S2 --> S1
Now, S3 and S2 are ready for the next reduction, but S0 and S1 still have
work left in the current reduction:
S1 --> S0
Since S0 (the root process) takes a little time to finish processing the
reduction, it is going to be a little behind S1. So we end up with the
If sm used a system of ACKs as in portals, we would know when we
are overloading the root process. However, since it does not, and the
reduction itself is non-blocking, we have the potential to exhaust memory.
I guess that the real question is whether the reduction should be blocking
or whether we expect the user to protect himself.
From: devel-bounces_at_[hidden] [mailto:devel-bounces_at_[hidden]] On Behalf Of Eugene Loh
Sent: Friday, February 13, 2009 11:42 AM
To: Open MPI Developers
Subject: Re: [OMPI devel] RFC: Eliminateompi/class/ompi_[circular_buffer_]fifo.h
George Bosilca wrote:
> I can't confirm or deny. The only thing I can tell is that the same
> test works fine over other BTL, so this tent either to pinpoint a
> problem in the sm BTL or in a particular path in the PML (the one
> used by the sm BTL). I'll have to dig a little bit more into it, but
> I was hoping to do it in the context of the new sm BTL (just to avoid
> having to do it twice).
Okay. I'll try to get "single queue" put back soon and might look at
1791 along the way.
But here is what I wonder. Let's say you have one-way traffic -- either
rank A sending rank B messages without ever any traffic in the other
direction, or repeated MPI_Reduce operations always with the same root
-- and the senders somehow get well ahead of the receiver. Say, A wants
to pump 1,000,000 messages over and B is busy doing something else.
What should happen? What should the PML and BTL do? The conditions
could range from B not being in MPI at all, or B listening to the BTL
without yet having the posted receives to match. Should the connection
become congested and force the sender to wait -- and if so, is this in
the BTL or PML? Or, should B keep on queueing up the unexpected messages?
After some basic "single queue" putbacks, I'll try to look at the code
and understand what the PML is doing in cases like this.
devel mailing list