On Mon, 2006-11-27 at 15:57 -0800, Mark A. Grondona wrote:
> > On Mon, 2006-11-27 at 16:29 -0700, Brian W Barrett wrote:
> > > On Nov 27, 2006, at 4:19 PM, Matt Leininger wrote:
> > >
> > > > I've been running more tests of OpenMPI v1.2b. I've run into several
> > > > cases where the app+MPI use too much memory and the OOM handler kills
> > > > off tasks. Sometimes the ompi mpirun shuts down gracefully, but other
> > > > times the OOM handler may kill off 1 to 4 MPI tasks per node (when I'm
> > > > using 8 MPI tasks per node). The remaining MPI tasks keep
> > > > running/polling and have to be killed off by hand. Has anyone seen
> > > > this
> > > > behavior before?
> > >
> > > Are the orteds also getting killed?
> > Not sure. I'll check the next time I see this.
> I haven't seen any evidence that orteds are being killed by the Out of Memory
> killer. Only MPI application processes seem to be the chosen victim(s).
I can confirm this. I'm running a 2 node 16 MPI task job. On one
node all 8 mpi tasks where killed and the other node only had 1 mpi task
killed. The orted's are still running on each node, but it's not
> > >
> > > I'm not really familiar with the OOM killer -- does it cause the
> > > parent of the killed process to get a SIGCHLD? If not, that could be
> > > a fairly serious problem for us, as we rely on SIGCHLDs being
> > > received by the orteds when things die...
> > Mark Grondona could answer this. His reply to devel-core bounced so
> > I'm including devel_at_[hidden] on this thread.
> No, being killed by the OOM killer should be the same as being sent
> SIGKILL as far as userspace is concerned. SIGCHLD to the parent will still
> be sent (and wait(2) will return, etc.)