On Mon, 2006-11-27 at 16:29 -0700, Brian W Barrett wrote:
> On Nov 27, 2006, at 4:19 PM, Matt Leininger wrote:
> > I've been running more tests of OpenMPI v1.2b. I've run into several
> > cases where the app+MPI use too much memory and the OOM handler kills
> > off tasks. Sometimes the ompi mpirun shuts down gracefully, but other
> > times the OOM handler may kill off 1 to 4 MPI tasks per node (when I'm
> > using 8 MPI tasks per node). The remaining MPI tasks keep
> > running/polling and have to be killed off by hand. Has anyone seen
> > this
> > behavior before?
> Are the orteds also getting killed?
Not sure. I'll check the next time I see this.
> It's a known problem that if the
> orted is killed by outside forces, everything kind of hangs. We're
> working on this one, and hope to have it fixed by the time 1.2 ships.
That could be the problem.
> I'm not really familiar with the OOM killer -- does it cause the
> parent of the killed process to get a SIGCHLD? If not, that could be
> a fairly serious problem for us, as we rely on SIGCHLDs being
> received by the orteds when things die...
Mark Grondona could answer this. His reply to devel-core bounced so
I'm including devel_at_[hidden] on this thread.
> devel-core mailing list