Thanks to all who replied.
First, I'm running openmpi 1.4.2.
Second coredumpsize is unlimited, and indeed I DO get core dumps when
I'm running a single-processor version. Third, the problem isn't
stopping the program, MPI_Abort does that just fine, rather it's getting
a cordump. According to the man page, MPI_Abort sends a SIGTERM, not a
SIGABRT so perhaps that's what should happen.
Finally, my guess as to what's happening if I use the libc abort is that
the other nodes get stuck in an MPI call (I do lots of MPI_Reduces or
MPI_Bcasts in this code), but this doesn't explain why the node calling
abort doesn't exit with a coredump.
David
On Thu, 2010-08-12 at 20:44 -0600, Ralph Castain wrote:
> Sounds very strange - what OMPI version, on what type of machine, and how was it configured?
>
>
> On Aug 12, 2010, at 7:49 PM, David Ronis wrote:
>
> > I've got a mpi program that is supposed to to generate a core file if
> > problems arise on any of the nodes. I tried to do this by adding a
> > call to abort() to my exit routines but this doesn't work; I get no core
> > file, and worse, mpirun doesn't detect that one of my nodes has
> > aborted(?) and doesn't kill off the entire job, except in the trivial
> > case where the number of processors I'm running on is 1. I've replaced
> > abort with MPI_Abort, which kills everything off, but leaves no core
> > file. Any suggestions how I can get one and still have mpi exit?
> >
> > Thanks in advance.
> >
> > David
> >
> >
> >
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
|