It really does not matter what one does with the sm sends that can't be
posted to the FIFO, as long as they are posted at some later time. The
current implementation generates does not rely on the ordering memory
provides, but generates a sequence number and uses this in the matching,
just like any other btl. So one does not need to preserve the sending order,
like one would, if one avoided sequence numbers, and had to rely on the
memory ordering to satisfy MPI matching rules.
On 2/25/09 10:36 AM, "Eugene Loh" <Eugene.Loh_at_[hidden]> wrote:
> George Bosilca wrote:
>> On Feb 24, 2009, at 18:08 , Eugene Loh wrote:
>>> (Probably this message only for George, but I'll toss it out to the
> Actually, maybe Rich should weigh in here, too. This relates to the
> overflow mechanism in MCA_BTL_SM_FIFO_WRITE.
>>> I have a question about the sm sendi() function. What should happen
>>> if the sendi() function attempts to write to the FIFO, but the FIFO
>>> is full?
>> The write should not be queued except in the case where the whole
>> data referred by the convertor is copied out of the user memory.
> And this is indeed the case. The data-convertor copy completed
>> If the FIFO is full, the best will be to allocate the descriptor and
>> give it back to the PML.
> Why? The data has been copied out of the user's buffer. The pointer to
> that data has been queued for sending. (It hasn't been queued in the
> FIFO, which is full, but it has been queued in the pending-send list.)
> The FIFO has an overflow mechanism. Actually, prior to my recent
> putbacks, it had two overflow mechanisms. One was to grow the FIFO, and
> the other was to use the pending-send queue. While adding support for
> multiple senders per FIFO and at Rich's suggestion, I pulled out the
> ability to grow the FIFO. (Some number of folks didn't even believe
> that the FIFO-grow stuff even existed or was enabled or worked
> properly.) That still leaves the pending sends. So, the "out of
> resource" return code from the FIFO write is kind of spurious. The FIFO
> write is returning that code even though it has accepted the write and
> queued it up.
>>> Currently, it appears that the sendi() function returns an error
>>> code to the PML, which assumes that the sendi() tried to send the
>>> message but failed and so just tried to allocate a descriptor.
>> Yes, this is the expected behavior.
>>> But is that what should happen? The condition of the FIFO being
>>> full is a little misleading since the write is still queued for
>>> further progress -- not in the FIFO itself but in the pending-send
>>> queue. This distinction should perhaps not matter to the upper
>>> layers. The upper layers should still view the send as "completed"
>>> (buffered by the MPI implementation to be progressed later). I
>>> would think that the sendi() function should return a SUCCESS code.
>> If the write is queued then this is more or less a bug. We will
>> nicely cope with this case, because we have this sequence number and
>> we will drop a message duplicate, but we will end-up sending the same
>> message twice. The problem is that I don't know which of the copies
>> will be used on the receiver side, I guess the first one reaching the
> Arrgh! When the primary mechanism (FIFO) starts getting congested, we
> start pumping duplicate messages into the system?
> The proper fix (IMHO) is to have the sendi function return a SUCCESS
> code once it's written the message and the pointer to the message. And,
> once it's written those two things, it seems to me to be a bug to return
> any other code.
>>> Relevent source code is
>>> PML, line 496
>>> BTL, line 785
>>> FIFO write, line 18
> devel mailing list