I can't confirm or deny. The only thing I can tell is that the same
test works fine over other BTL, so this tent either to pinpoint a
problem in the sm BTL or in a particular path in the PML (the one used
by the sm BTL). I'll have to dig a little bit more into it, but I was
hoping to do it in the context of the new sm BTL (just to avoid having
to do it twice).
On Feb 13, 2009, at 08:05 , Jeff Squyres wrote:
> George -- can you confirm/deny? Is this something we need to fix
> for v1.3.1?
> On Feb 12, 2009, at 10:15 PM, Eugene Loh wrote:
>> Got it, thanks.
>> Is anyone else looking at that ticket? I'm still a newbie and I
>> suspect someone else could figure this problem out a lot faster
>> than I could. So, I'm curious how much I should be looking at this
>> If amateurs are allowed to speculate, however, my guess is that
>> this isn't really a BTL thing. It reminds me of trac ticket 1468
>> (aka 1516). In that case, there was a lot of one-way traffic. We
>> needed a way to return frags to the sender. I guess that was solved.
>> So, the present problem is something different. My guess is that
>> senders are overrunning receivers. Could it be that some receiver
>> (like the root in the MPI_Reduce) ends up with too many in-coming
>> messages. It has to queue up unexpected messages, which slows it
>> down further, which means it has to deal with even more unexpected
>> messages, etc. Those messages have to be placed somewhere, which
>> means memory is allocated, etc.?
>> Just a theory. I don't know the PML well enough to judge its
>> But if this is the case, it's a PML issue rather than a BTL issue.
>> Maybe there should be some flow control -- particular in our
>> implementation of collectives!
>> Ralph Castain wrote:
>>> The connection is only that, if you are going to modify the sm BTL
>>> as you say, you might at least want to be aware that we have a
>>> problem in it so you (a) don't make it worse than it already is,
>>> and (b) might keep an eye open for the problem as you are
>>> changing things.
>>> On Feb 12, 2009, at 3:58 PM, Eugene Loh wrote:
>>>> Sorry, what's the connection? Are we talking about https://svn.open-mpi.org/trac/ompi/ticket/1791
>>>> ? Are you simply saying that if I'm doing some sm BTL work, I
>>>> should also look at 1791? I'm trying to figure out if there's
>>>> some more specific connection I'm missing.
>>>> Ralph Castain wrote:
>>>>> You might want to look at ticket #1791 while you are doing this
>>>>> - Brad added some valuable data earlier today.
>> devel mailing list
> Jeff Squyres
> Cisco Systems
> devel mailing list