Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Gleb Natapov (glebn_at_[hidden])
Date: 2007-10-23 10:19:07


On Tue, Oct 23, 2007 at 09:40:45AM -0400, Shipman, Galen M. wrote:
> So this problem goes WAY back..
>
> The problem here is that the PML marks MPI completion just prior to calling
> btl_send and then returns to the user. This wouldn't be a problem if the BTL
> then did something, but in the case of OpenIB this fragment may not actually
> be on the wire (the joys of user level flow control).
>
> One solution that we proposed was to allow btl_send to return either
> OMPI_SUCCESS or OMPI_NOT_ON_WIRE. OMPI_NOT_ON_WIRE would allow the PML to
> not mark MPI completion of the fragment and then MPI_WAITALL and others will
> do there job properly.
I even implemented this once, but there is a problem. Currently we mark
request as completed on MPI level and then do btl_send(). Whenever IB completion
will happen the request will be marked as complete on PML level and
freed. The fix requires to change the order like this: Call btl_send(),
check return value from BTL and mark request complete as necessary. The
problem is that because we allow BTL to call opal_progress() internally the
request may be already completed on MPI and MPL levels and freed before return from
the call to btl_send().

I did a code review to see how hard it will be to get rid of recursion
in Open MPI and I think this is doable. We have to disallow calling
progress() (or other functions that may call progress() internally) from
BTL and from ULP callbacks that are called by BTL. There is no much
places that break this law. The main offenders are calls to
FREE_LIST_WAIT(), but those never actually call progress if they can
grow without limit and this is the most common use of FREE_LIST_WAIT()
so they may be safely changed to FREE_LIST_GET(). After we will solve
recursion problem the fix to the problem will be a couple of lines of
code.

>
> - Galen
>
>
>
> On 10/11/07 11:26 AM, "Gleb Natapov" <glebn_at_[hidden]> wrote:
>
> > On Fri, Oct 05, 2007 at 09:43:44AM +0200, Jeff Squyres wrote:
> >> David --
> >>
> >> Gleb and I just actively re-looked at this problem yesterday; we
> >> think it's related to https://svn.open-mpi.org/trac/ompi/ticket/
> >> 1015. We previously thought this ticket was a different problem, but
> >> our analysis yesterday shows that it could be a real problem in the
> >> openib BTL or ob1 PML (kinda think it's the openib btl because it
> >> doesn't seem to happen on other networks, but who knows...).
> >>
> >> Gleb is investigating.
> > Here is the result of the investigation. The problem is different than
> > #1015 ticket. What we have here is one rank calls isend() of a small
> > message and wait_all() in a loop and another one calls irecv(). The
> > problem is that isend() usually doesn't call opal_progress() anywhere
> > and wait_all() doesn't call progress if all requests are already completed
> > so messages are never progressed. We may force opal_progress() to be called
> > by setting btl_openib_free_list_max to 1000. Then wait_all() will call
> > progress because not every request will be immediately completed by OB1. Or
> > we can limit a number of uncompleted requests that OB1 can allocate by setting
> > pml_ob1_free_list_max to 1000. Then opal_progress() will be called from a
> > free_list_wait() when max will be reached. The second option works much
> > faster for me.
> >
> >>
> >>
> >>
> >> On Oct 5, 2007, at 12:59 AM, David Daniel wrote:
> >>
> >>> Hi Folks,
> >>>
> >>> I have been seeing some nasty behaviour in collectives,
> >>> particularly bcast and reduce. Attached is a reproducer (for bcast).
> >>>
> >>> The code will rapidly slow to a crawl (usually interpreted as a
> >>> hang in real applications) and sometimes gets killed with sigbus or
> >>> sigterm.
> >>>
> >>> I see this with
> >>>
> >>> openmpi-1.2.3 or openmpi-1.2.4
> >>> ofed 1.2
> >>> linux 2.6.19 + patches
> >>> gcc (GCC) 3.4.5 20051201 (Red Hat 3.4.5-2)
> >>> 4 socket, dual core opterons
> >>>
> >>> run as
> >>>
> >>> mpirun --mca btl self,openib --npernode 1 --np 4 bcast-hang
> >>>
> >>> To my now uneducated eye it looks as if the root process is rushing
> >>> ahead and not progressing earlier bcasts.
> >>>
> >>> Anyone else seeing similar? Any ideas for workarounds?
> >>>
> >>> As a point of reference, mvapich2 0.9.8 works fine.
> >>>
> >>> Thanks, David
> >>>
> >>>
> >>> <bcast-hang.c>
> >>> _______________________________________________
> >>> devel mailing list
> >>> devel_at_[hidden]
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>
> >>
> >> --
> >> Jeff Squyres
> >> Cisco Systems
> >>
> >> _______________________________________________
> >> devel mailing list
> >> devel_at_[hidden]
> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >
> > --
> > Gleb.
> > _______________________________________________
> > devel mailing list
> > devel_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
			Gleb.