Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Gleb Natapov (glebn_at_[hidden])
Date: 2007-08-29 10:47:27


On Wed, Aug 29, 2007 at 10:46:06AM -0400, Richard Graham wrote:
> Gleb,
> Are you looking at this ?
Not today. And I need the code to reproduce the bug. Is this possible?

>
> Rich
>
>
> On 8/29/07 9:56 AM, "Gleb Natapov" <glebn_at_[hidden]> wrote:
>
> > On Wed, Aug 29, 2007 at 04:48:07PM +0300, Gleb Natapov wrote:
> >> Is this trunk or 1.2?
> > Oops. I should read more carefully :) This is trunk.
> >
> >>
> >> On Wed, Aug 29, 2007 at 09:40:30AM -0400, Terry D. Dontje wrote:
> >>> I have a program that does a simple bucket brigade of sends and receives
> >>> where rank 0 is the start and repeatedly sends to rank 1 until a certain
> >>> amount of time has passed and then it sends and all done packet.
> >>>
> >>> Running this under np=2 always works. However, when I run with greater
> >>> than 2 using only the SM btl the program usually hangs and one of the
> >>> processes has a long stack that has a lot of the following 3 calls in it:
> >>>
> >>> [25] opal_progress(), line 187 in "opal_progress.c"
> >>> [26] mca_btl_sm_component_progress(), line 397 in "btl_sm_component.c"
> >>> [27] mca_bml_r2_progress(), line 110 in "bml_r2.c"
> >>>
> >>> When stepping through the ompi_fifo_write_to_head routine it looks like
> >>> the fifo has overflowed.
> >>>
> >>> I am wondering if what is happening is rank 0 has sent a bunch of
> >>> messages that have exhausted the
> >>> resources such that one of the middle ranks which is in the process of
> >>> sending cannot send and therefore
> >>> never gets to the point of trying to receive the messages from rank 0?
> >>>
> >>> Is the above a possible scenario or are messages periodically bled off
> >>> the SM BTL's fifos?
> >>>
> >>> Note, I have seen np=3 pass sometimes and I can get it to pass reliably
> >>> if I raise the shared memory space used by the BTL. This is using the
> >>> trunk.
> >>>
> >>>
> >>> --td
> >>>
> >>>
> >>> _______________________________________________
> >>> devel mailing list
> >>> devel_at_[hidden]
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>
> >> --
> >> Gleb.
> >> _______________________________________________
> >> devel mailing list
> >> devel_at_[hidden]
> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >
> > --
> > Gleb.
> > _______________________________________________
> > devel mailing list
> > devel_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
			Gleb.