Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] matching code rewrite in OB1
From: Richard Graham (rlgraham_at_[hidden])
Date: 2007-12-11 16:14:13

I will re-iterate my concern. The code that is there now is mostly nine
years old (with some mods made when it was brought over to Open MPI). It
took about 2 months of testing on systems with 5-13 way network parallelism
to track down all KNOWN race conditions. This code is at the center of MPI
correctness, so I am VERY concerned about changing it w/o some very strong
reasons. Not apposed, just very cautious.


On 12/11/07 11:47 AM, "Gleb Natapov" <glebn_at_[hidden]> wrote:

> On Tue, Dec 11, 2007 at 08:36:42AM -0800, Andrew Friedley wrote:
>> Possibly, though I have results from a benchmark I've written indicating
>> the reordering happens at the sender. I believe I found it was due to
>> the QP striping trick I use to get more bandwidth -- if you back down to
>> one QP (there's a define in the code you can change), the reordering
>> rate drops.
> Ah, OK. My assumption was just from looking into code, so I may be
> wrong.
>> Also I do not make any recursive calls to progress -- at least not
>> directly in the BTL; I can't speak for the upper layers. The reason I
>> do many completions at once is that it is a big help in turning around
>> receive buffers, making it harder to run out of buffers and drop frags.
>> I want to say there was some performance benefit as well but I can't
>> say for sure.
> Currently upper layers of Open MPI may call BTL progress function
> recursively. I hope this will change some day.
>> Andrew
>> Gleb Natapov wrote:
>>> On Tue, Dec 11, 2007 at 08:03:52AM -0800, Andrew Friedley wrote:
>>>> Try UD, frags are reordered at a very high rate so should be a good test.
>>> Good Idea I'll try this. BTW I thing the reason for such a high rate of
>>> reordering in UD is that it polls for MCA_BTL_UD_NUM_WC completions
>>> (500) and process them one by one and if progress function is called
>>> recursively next 500 completion will be reordered versus previous
>>> completions (reordering happens on a receiver, not sender).
>>>> Andrew
>>>> Richard Graham wrote:
>>>>> Gleb,
>>>>> I would suggest that before this is checked in this be tested on a
>>>>> system
>>>>> that has N-way network parallelism, where N is as large as you can find.
>>>>> This is a key bit of code for MPI correctness, and out-of-order operations
>>>>> will break it, so you want to maximize the chance for such operations.
>>>>> Rich
>>>>> On 12/11/07 10:54 AM, "Gleb Natapov" <glebn_at_[hidden]> wrote:
>>>>>> Hi,
>>>>>> I did a rewrite of matching code in OB1. I made it much simpler and 2
>>>>>> times smaller (which is good, less code - less bugs). I also got rid
>>>>>> of huge macros - very helpful if you need to debug something. There
>>>>>> is no performance degradation, actually I even see very small performance
>>>>>> improvement. I ran MTT with this patch and the result is the same as on
>>>>>> trunk. I would like to commit this to the trunk. The patch is attached
>>>>>> for everybody to try.
>>>>>> --
>>>>>> Gleb.
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> devel_at_[hidden]
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>> --
>>> Gleb.
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
> --
> Gleb.
> _______________________________________________
> devel mailing list
> devel_at_[hidden]