Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Abort
From: David Ronis (David.Ronis_at_[hidden])
Date: 2010-08-16 13:23:50


Hi Jeff,

I've reproduced your test here, with the same results. Moreover, if I
put the nodes with rank>0 into a blocking MPI call (MPI_Bcast or
MPI_Barrier) I still get the same behavior; namely, rank 0's calling
abort() generates a core file and leads to termination, which is the
behavior I want. I'll look at my code a bit more, but the only
difference I see now is that in my code a floating point exception
triggers a signal-handler that calls abort(). I don't see why that
should be different from your test.

Thanks for your help.

David

On Mon, 2010-08-16 at 09:54 -0700, Jeff Squyres wrote:
> FWIW, I'm unable to replicate your behavior. This is with Open MPI 1.4.2 on RHEL5:
>
> ----
> [9:52] svbu-mpi:~/mpi % cat abort.c
> #include <stdio.h>
> #include <stdlib.h>
> #include <mpi.h>
>
> int main(int argc, char **argv)
> {
> int rank;
>
> MPI_Init(&argc, &argv);
> MPI_Comm_rank(MPI_COMM_WORLD, &rank);
> if (0 == rank) {
> abort();
> }
> printf("Rank %d sleeping...\n", rank);
> sleep(600);
> printf("Rank %d finalizing...\n", rank);
> MPI_Finalize();
> return 0;
> }
> [9:52] svbu-mpi:~/mpi % mpicc abort.c -o abort
> [9:52] svbu-mpi:~/mpi % ls -l core*
> ls: No match.
> [9:52] svbu-mpi:~/mpi % mpirun -np 4 --bynode --host svbu-mpi055,svbu-mpi056 ./abort
> Rank 1 sleeping...
> [svbu-mpi055:03991] *** Process received signal ***
> [svbu-mpi055:03991] Signal: Aborted (6)
> [svbu-mpi055:03991] Signal code: (-6)
> [svbu-mpi055:03991] [ 0] /lib64/libpthread.so.0 [0x2b45caac87c0]
> [svbu-mpi055:03991] [ 1] /lib64/libc.so.6(gsignal+0x35) [0x2b45cad05265]
> [svbu-mpi055:03991] [ 2] /lib64/libc.so.6(abort+0x110) [0x2b45cad06d10]
> [svbu-mpi055:03991] [ 3] ./abort(main+0x36) [0x4008ee]
> [svbu-mpi055:03991] [ 4] /lib64/libc.so.6(__libc_start_main+0xf4) [0x2b45cacf2994]
> [svbu-mpi055:03991] [ 5] ./abort [0x400809]
> [svbu-mpi055:03991] *** End of error message ***
> Rank 3 sleeping...
> Rank 2 sleeping...
> --------------------------------------------------------------------------
> mpirun noticed that process rank 0 with PID 3991 on node svbu-mpi055 exited on signal 6 (Aborted).
> --------------------------------------------------------------------------
> [9:52] svbu-mpi:~/mpi % ls -l core*
> -rw------- 1 jsquyres eng5 26009600 Aug 16 09:52 core.abort-1281977540-3991
> [9:52] svbu-mpi:~/mpi % file core.abort-1281977540-3991
> core.abort-1281977540-3991: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV), SVR4-style, from 'abort'
> [9:52] svbu-mpi:~/mpi %
> -----
>
> You can see that all processes die immediately, and I get a corefile from the process that called abort().
>
>
> On Aug 16, 2010, at 9:25 AM, David Ronis wrote:
>
> > I've tried both--as you said, MPI_Abort doesn't drop a core file, but
> > does kill off the entire MPI job. abort() drops core when I'm running
> > on 1 processor, but not in a multiprocessor run. In addition, a node
> > calling abort() doesn't lead to the entire run being killed off.
> >
> > David
> > O
> > n Mon, 2010-08-16 at 08:51 -0700, Jeff Squyres wrote:
> >> On Aug 13, 2010, at 12:53 PM, David Ronis wrote:
> >>
> >>> I'm using mpirun and the nodes are all on the same machin (a 8 cpu box
> >>> with an intel i7). coresize is unlimited:
> >>>
> >>> ulimit -a
> >>> core file size (blocks, -c) unlimited
> >>
> >> That looks good.
> >>
> >> In reviewing the email thread, it's not entirely clear: are you calling abort() or MPI_Abort()? MPI_Abort() won't drop a core file. abort() should.
> >>
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>