Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Matt Leininger (mlleinin_at_[hidden])
Date: 2006-11-27 19:17:48


On Mon, 2006-11-27 at 15:57 -0800, Mark A. Grondona wrote:
> > On Mon, 2006-11-27 at 16:29 -0700, Brian W Barrett wrote:
> > > On Nov 27, 2006, at 4:19 PM, Matt Leininger wrote:
> > >
> > > > I've been running more tests of OpenMPI v1.2b. I've run into several
> > > > cases where the app+MPI use too much memory and the OOM handler kills
> > > > off tasks. Sometimes the ompi mpirun shuts down gracefully, but other
> > > > times the OOM handler may kill off 1 to 4 MPI tasks per node (when I'm
> > > > using 8 MPI tasks per node). The remaining MPI tasks keep
> > > > running/polling and have to be killed off by hand. Has anyone seen
> > > > this
> > > > behavior before?
> > >
> > > Are the orteds also getting killed?
> >
> > Not sure. I'll check the next time I see this.
> >
>
> I haven't seen any evidence that orteds are being killed by the Out of Memory
> killer. Only MPI application processes seem to be the chosen victim(s).

  I can confirm this. I'm running a 2 node 16 MPI task job. On one
node all 8 mpi tasks where killed and the other node only had 1 mpi task
killed. The orted's are still running on each node, but it's not
cleaning up.

  - Matt
>
>
> > >
> > > I'm not really familiar with the OOM killer -- does it cause the
> > > parent of the killed process to get a SIGCHLD? If not, that could be
> > > a fairly serious problem for us, as we rely on SIGCHLDs being
> > > received by the orteds when things die...
> >
> > Mark Grondona could answer this. His reply to devel-core bounced so
> > I'm including devel_at_[hidden] on this thread.
>
>
> No, being killed by the OOM killer should be the same as being sent
> SIGKILL as far as userspace is concerned. SIGCHLD to the parent will still
> be sent (and wait(2) will return, etc.)
>
> mark
>