Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Fwd: [Open MPI] #1927: v1.3 COMM_SPAWN loop testfails after ~120 spawns
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-06-09 09:19:55


Tested -- seem to work for me. I say we now let MTT sort it out
(i.e., see if others hit this race condition) and apply to v1.3.

On Jun 9, 2009, at 4:46 AM, Ralph Castain wrote:

> I don't think it would be very hard - I would have to create a patch
> for it, but the fix is completely contained in one file and location.
>
> I would like to have someone else test it, though, before we move it
> across. It worked for me, but since it is a race condition, that isn't
> entirely convincing.
>
>
> On Jun 9, 2009, at 5:41 AM, Jeff Squyres wrote:
>
> > I'd be in favor of bringing this to v1.3. Are there other
> > dependencies / would it be difficult?
> >
> >
> > Begin forwarded message:
> >
> >> From: "Open MPI" <bugs_at_[hidden]>
> >> Date: June 8, 2009 11:31:20 AM PDT
> >> Cc: <bugs_at_[hidden]>
> >> Subject: Re: [Open MPI] #1927: v1.3 COMM_SPAWN loop test fails
> >> after ~120 spawns
> >>
> >> #1927: v1.3 COMM_SPAWN loop test fails after ~120 spawns
> >> -----------------------
> >> +----------------------------------------------------
> >> Reporter: jsquyres | Owner: rhc
> >> Type: defect | Status: closed
> >> Priority: critical | Milestone: Open MPI 1.3.4
> >> Version: 1.3 branch | Resolution: fixed
> >> Keywords: |
> >> -----------------------
> >> +----------------------------------------------------
> >> Changes (by rhc):
> >>
> >> * status: new => closed
> >> * resolution: => fixed
> >>
> >>
> >> Comment:
> >>
> >> This was due to a very tight loop on comm_spawn not giving enough
> >> time for
> >> the prior proc to completely terminate (and thus free its file
> >> descriptors) before the next proc was launched. Eventually, we
> >> built up a
> >> backlog of terminations to process and ran out of fd's.
> >>
> >> We introduced a check-and-delay in the code that detects we don't
> >> have
> >> enough fd's to launch another proc, and then waits a second to
> see if
> >> enough become free before aborting.
> >>
> >> Fixed in trunk - can see if we want to bring it to 1.3.
> >>
> >> --
> >> Ticket URL: <https://svn.open-mpi.org/trac/ompi/ticket/
> 1927#comment:
> >> 3>
> >> Open MPI <http://www.open-mpi.org/>
> >>
> >>
> >
> >
> > --
> > Jeff Squyres
> > Cisco Systems
> >
> > _______________________________________________
> > devel mailing list
> > devel_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>

-- 
Jeff Squyres
Cisco Systems