Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: [OMPI devel] Fwd: [Open MPI] #1927: v1.3 COMM_SPAWN loop test fails after ~120 spawns
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-06-09 07:41:29


I'd be in favor of bringing this to v1.3. Are there other
dependencies / would it be difficult?

Begin forwarded message:

> From: "Open MPI" <bugs_at_[hidden]>
> Date: June 8, 2009 11:31:20 AM PDT
> Cc: <bugs_at_[hidden]>
> Subject: Re: [Open MPI] #1927: v1.3 COMM_SPAWN loop test fails after
> ~120 spawns
>
> #1927: v1.3 COMM_SPAWN loop test fails after ~120 spawns
> -----------------------
> +----------------------------------------------------
> Reporter: jsquyres | Owner: rhc
> Type: defect | Status: closed
> Priority: critical | Milestone: Open MPI 1.3.4
> Version: 1.3 branch | Resolution: fixed
> Keywords: |
> -----------------------
> +----------------------------------------------------
> Changes (by rhc):
>
> * status: new => closed
> * resolution: => fixed
>
>
> Comment:
>
> This was due to a very tight loop on comm_spawn not giving enough
> time for
> the prior proc to completely terminate (and thus free its file
> descriptors) before the next proc was launched. Eventually, we
> built up a
> backlog of terminations to process and ran out of fd's.
>
> We introduced a check-and-delay in the code that detects we don't
> have
> enough fd's to launch another proc, and then waits a second to see if
> enough become free before aborting.
>
> Fixed in trunk - can see if we want to bring it to 1.3.
>
> --
> Ticket URL: <https://svn.open-mpi.org/trac/ompi/ticket/1927#comment:3>
> Open MPI <http://www.open-mpi.org/>
>
>

-- 
Jeff Squyres
Cisco Systems