On Feb 25, 2009, at 10:36 , Eugene Loh wrote:
> George Bosilca wrote:
>> On Feb 24, 2009, at 18:08 , Eugene Loh wrote:
>>> (Probably this message only for George, but I'll toss it out to
>>> the alias/archive.)
> Actually, maybe Rich should weigh in here, too. This relates to the
> overflow mechanism in MCA_BTL_SM_FIFO_WRITE.
>>> I have a question about the sm sendi() function. What should
>>> happen if the sendi() function attempts to write to the FIFO, but
>>> the FIFO is full?
>> The write should not be queued except in the case where the whole
>> data referred by the convertor is copied out of the user memory.
> And this is indeed the case. The data-convertor copy completed
>> If the FIFO is full, the best will be to allocate the descriptor
>> and give it back to the PML.
> Why? The data has been copied out of the user's buffer. The
> pointer to that data has been queued for sending. (It hasn't been
> queued in the FIFO, which is full, but it has been queued in the
> pending-send list.)
As I previously state, if the data is copied out of the user buffer,
the sendi should always return success. However, having a queue in the
BTL only duplicates the queue from the PML.
> The FIFO has an overflow mechanism. Actually, prior to my recent
> putbacks, it had two overflow mechanisms. One was to grow the FIFO,
> and the other was to use the pending-send queue. While adding
> support for multiple senders per FIFO and at Rich's suggestion, I
> pulled out the ability to grow the FIFO. (Some number of folks
> didn't even believe that the FIFO-grow stuff even existed or was
> enabled or worked properly.) That still leaves the pending sends.
> So, the "out of resource" return code from the FIFO write is kind of
> spurious. The FIFO write is returning that code even though it has
> accepted the write and queued it up.
>>> Currently, it appears that the sendi() function returns an error
>>> code to the PML, which assumes that the sendi() tried to send the
>>> message but failed and so just tried to allocate a descriptor.
>> Yes, this is the expected behavior.
>>> But is that what should happen? The condition of the FIFO being
>>> full is a little misleading since the write is still queued for
>>> further progress -- not in the FIFO itself but in the pending-
>>> send queue. This distinction should perhaps not matter to the
>>> upper layers. The upper layers should still view the send as
>>> "completed" (buffered by the MPI implementation to be progressed
>>> later). I would think that the sendi() function should return a
>>> SUCCESS code.
>> If the write is queued then this is more or less a bug. We will
>> nicely cope with this case, because we have this sequence number
>> and we will drop a message duplicate, but we will end-up sending
>> the same message twice. The problem is that I don't know which of
>> the copies will be used on the receiver side, I guess the first
>> one reaching the receiver.
> Arrgh! When the primary mechanism (FIFO) starts getting congested,
> we start pumping duplicate messages into the system?
If the BTL queue the send internally and returns an error, then the
- go back in the mca_pml_ob1_send_request_start with the error set to
- will continue over the list of available BTL for the eager and try
to send the same message again.
- in the case no more BTLs are available it will add the request to
the pending queue, and it will reschedule it later.
So the answer is yes, if a BTL returns an error while adding the data
in its own queues, then we will duplicate the send operation.
> The proper fix (IMHO) is to have the sendi function return a SUCCESS
> code once it's written the message and the pointer to the message.
> And, once it's written those two things, it seems to me to be a bug
> to return any other code.
>>> Relevent source code is
>>> PML, line 496
>>> BTL, line 785
>>> FIFO write, line 18
> devel mailing list