Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] matching code rewrite in OB1
From: Gleb Natapov (glebn_at_[hidden])
Date: 2007-12-12 15:51:55


On Wed, Dec 12, 2007 at 03:46:10PM -0500, Richard Graham wrote:
> This is better than nothing, but really not very helpful for looking at the
> specific issues that can arise with this, unless these systems have several
> parallel networks, with tests that will generate a lot of parallel network
> traffic, and be able to self check for out-of-order received - i.e. this
> needs to be encoded into the payload for verification purposes. There are
> some out-of-order scenarios that need to be generated and checked. I think
> that George may have a system that will be good for this sort of testing.
>
I am running various test with multiple networks right now. I use
several IB BTLs and TCP BTL simultaneously. I see many reordered
messages and all tests were OK till now, but they don't encode
message sequence in a payload as far as I know. I'll change one of
them to do so.

> Rich
>
>
> On 12/12/07 3:20 PM, "Gleb Natapov" <glebn_at_[hidden]> wrote:
>
> > On Wed, Dec 12, 2007 at 11:57:11AM -0500, Jeff Squyres wrote:
> >> Gleb --
> >>
> >> How about making a tarball with this patch in it that can be thrown at
> >> everyone's MTT? (we can put the tarball on www.open-mpi.org somewhere)
> > I don't have access to www.open-mpi.org, but I can send you the patch.
> > I can send you a tarball too, but I prefer to not abuse email.
> >
> >>
> >>
> >> On Dec 11, 2007, at 4:14 PM, Richard Graham wrote:
> >>
> >>> I will re-iterate my concern. The code that is there now is mostly
> >>> nine
> >>> years old (with some mods made when it was brought over to Open
> >>> MPI). It
> >>> took about 2 months of testing on systems with 5-13 way network
> >>> parallelism
> >>> to track down all KNOWN race conditions. This code is at the center
> >>> of MPI
> >>> correctness, so I am VERY concerned about changing it w/o some very
> >>> strong
> >>> reasons. Not apposed, just very cautious.
> >>>
> >>> Rich
> >>>
> >>>
> >>> On 12/11/07 11:47 AM, "Gleb Natapov" <glebn_at_[hidden]> wrote:
> >>>
> >>>> On Tue, Dec 11, 2007 at 08:36:42AM -0800, Andrew Friedley wrote:
> >>>>> Possibly, though I have results from a benchmark I've written
> >>>>> indicating
> >>>>> the reordering happens at the sender. I believe I found it was
> >>>>> due to
> >>>>> the QP striping trick I use to get more bandwidth -- if you back
> >>>>> down to
> >>>>> one QP (there's a define in the code you can change), the reordering
> >>>>> rate drops.
> >>>> Ah, OK. My assumption was just from looking into code, so I may be
> >>>> wrong.
> >>>>
> >>>>>
> >>>>> Also I do not make any recursive calls to progress -- at least not
> >>>>> directly in the BTL; I can't speak for the upper layers. The
> >>>>> reason I
> >>>>> do many completions at once is that it is a big help in turning
> >>>>> around
> >>>>> receive buffers, making it harder to run out of buffers and drop
> >>>>> frags.
> >>>>> I want to say there was some performance benefit as well but I
> >>>>> can't
> >>>>> say for sure.
> >>>> Currently upper layers of Open MPI may call BTL progress function
> >>>> recursively. I hope this will change some day.
> >>>>
> >>>>>
> >>>>> Andrew
> >>>>>
> >>>>> Gleb Natapov wrote:
> >>>>>> On Tue, Dec 11, 2007 at 08:03:52AM -0800, Andrew Friedley wrote:
> >>>>>>> Try UD, frags are reordered at a very high rate so should be a
> >>>>>>> good test.
> >>>>>> Good Idea I'll try this. BTW I thing the reason for such a high
> >>>>>> rate of
> >>>>>> reordering in UD is that it polls for MCA_BTL_UD_NUM_WC completions
> >>>>>> (500) and process them one by one and if progress function is
> >>>>>> called
> >>>>>> recursively next 500 completion will be reordered versus previous
> >>>>>> completions (reordering happens on a receiver, not sender).
> >>>>>>
> >>>>>>> Andrew
> >>>>>>>
> >>>>>>> Richard Graham wrote:
> >>>>>>>> Gleb,
> >>>>>>>> I would suggest that before this is checked in this be tested
> >>>>>>>> on a
> >>>>>>>> system
> >>>>>>>> that has N-way network parallelism, where N is as large as you
> >>>>>>>> can find.
> >>>>>>>> This is a key bit of code for MPI correctness, and out-of-order
> >>>>>>>> operations
> >>>>>>>> will break it, so you want to maximize the chance for such
> >>>>>>>> operations.
> >>>>>>>>
> >>>>>>>> Rich
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 12/11/07 10:54 AM, "Gleb Natapov" <glebn_at_[hidden]> wrote:
> >>>>>>>>
> >>>>>>>>> Hi,
> >>>>>>>>>
> >>>>>>>>> I did a rewrite of matching code in OB1. I made it much
> >>>>>>>>> simpler and 2
> >>>>>>>>> times smaller (which is good, less code - less bugs). I also
> >>>>>>>>> got rid
> >>>>>>>>> of huge macros - very helpful if you need to debug something.
> >>>>>>>>> There
> >>>>>>>>> is no performance degradation, actually I even see very small
> >>>>>>>>> performance
> >>>>>>>>> improvement. I ran MTT with this patch and the result is the
> >>>>>>>>> same as on
> >>>>>>>>> trunk. I would like to commit this to the trunk. The patch is
> >>>>>>>>> attached
> >>>>>>>>> for everybody to try.
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> Gleb.
> >>>>>>>>> _______________________________________________
> >>>>>>>>> devel mailing list
> >>>>>>>>> devel_at_[hidden]
> >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>>>>>> _______________________________________________
> >>>>>>>> devel mailing list
> >>>>>>>> devel_at_[hidden]
> >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>>>>> _______________________________________________
> >>>>>>> devel mailing list
> >>>>>>> devel_at_[hidden]
> >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>>>>
> >>>>>> --
> >>>>>> Gleb.
> >>>>>> _______________________________________________
> >>>>>> devel mailing list
> >>>>>> devel_at_[hidden]
> >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>>> _______________________________________________
> >>>>> devel mailing list
> >>>>> devel_at_[hidden]
> >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>>
> >>>> --
> >>>> Gleb.
> >>>> _______________________________________________
> >>>> devel mailing list
> >>>> devel_at_[hidden]
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>
> >>> _______________________________________________
> >>> devel mailing list
> >>> devel_at_[hidden]
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>
> >>
> >> --
> >> Jeff Squyres
> >> Cisco Systems
> >> _______________________________________________
> >> devel mailing list
> >> devel_at_[hidden]
> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >
> > --
> > Gleb.
> > _______________________________________________
> > devel mailing list
> > devel_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
			Gleb.