Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] Abort
From: David Ronis (David.Ronis_at_[hidden])
Date: 2010-08-16 13:23:50


Hi Jeff,

I've reproduced your test here, with the same results. Moreover, if I
put the nodes with rank>0 into a blocking MPI call (MPI_Bcast or
MPI_Barrier) I still get the same behavior; namely, rank 0's calling
abort() generates a core file and leads to termination, which is the
behavior I want. I'll look at my code a bit more, but the only
difference I see now is that in my code a floating point exception
triggers a signal-handler that calls abort(). I don't see why that
should be different from your test.

Thanks for your help.

David

On Mon, 2010-08-16 at 09:54 -0700, Jeff Squyres wrote:
> FWIW, I'm unable to replicate your behavior. This is with Open MPI 1.4.2 on RHEL5:
>
> ----
> [9:52] svbu-mpi:~/mpi % cat abort.c
> #include <stdio.h>
> #include <stdlib.h>
> #include <mpi.h>
>
> int main(int argc, char **argv)
> {
> int rank;
>
> MPI_Init(&argc, &argv);
> MPI_Comm_rank(MPI_COMM_WORLD, &rank);
> if (0 == rank) {
> abort();
> }
> printf("Rank %d sleeping...\n", rank);
> sleep(600);
> printf("Rank %d finalizing...\n", rank);
> MPI_Finalize();
> return 0;
> }
> [9:52] svbu-mpi:~/mpi % mpicc abort.c -o abort
> [9:52] svbu-mpi:~/mpi % ls -l core*
> ls: No match.
> [9:52] svbu-mpi:~/mpi % mpirun -np 4 --bynode --host svbu-mpi055,svbu-mpi056 ./abort
> Rank 1 sleeping...
> [svbu-mpi055:03991] *** Process received signal ***
> [svbu-mpi055:03991] Signal: Aborted (6)
> [svbu-mpi055:03991] Signal code: (-6)
> [svbu-mpi055:03991] [ 0] /lib64/libpthread.so.0 [0x2b45caac87c0]
> [svbu-mpi055:03991] [ 1] /lib64/libc.so.6(gsignal+0x35) [0x2b45cad05265]
> [svbu-mpi055:03991] [ 2] /lib64/libc.so.6(abort+0x110) [0x2b45cad06d10]
> [svbu-mpi055:03991] [ 3] ./abort(main+0x36) [0x4008ee]
> [svbu-mpi055:03991] [ 4] /lib64/libc.so.6(__libc_start_main+0xf4) [0x2b45cacf2994]
> [svbu-mpi055:03991] [ 5] ./abort [0x400809]
> [svbu-mpi055:03991] *** End of error message ***
> Rank 3 sleeping...
> Rank 2 sleeping...
> --------------------------------------------------------------------------
> mpirun noticed that process rank 0 with PID 3991 on node svbu-mpi055 exited on signal 6 (Aborted).
> --------------------------------------------------------------------------
> [9:52] svbu-mpi:~/mpi % ls -l core*
> -rw------- 1 jsquyres eng5 26009600 Aug 16 09:52 core.abort-1281977540-3991
> [9:52] svbu-mpi:~/mpi % file core.abort-1281977540-3991
> core.abort-1281977540-3991: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV), SVR4-style, from 'abort'
> [9:52] svbu-mpi:~/mpi %
> -----
>
> You can see that all processes die immediately, and I get a corefile from the process that called abort().
>
>
> On Aug 16, 2010, at 9:25 AM, David Ronis wrote:
>
> > I've tried both--as you said, MPI_Abort doesn't drop a core file, but
> > does kill off the entire MPI job. abort() drops core when I'm running
> > on 1 processor, but not in a multiprocessor run. In addition, a node
> > calling abort() doesn't lead to the entire run being killed off.
> >
> > David
> > O
> > n Mon, 2010-08-16 at 08:51 -0700, Jeff Squyres wrote:
> >> On Aug 13, 2010, at 12:53 PM, David Ronis wrote:
> >>
> >>> I'm using mpirun and the nodes are all on the same machin (a 8 cpu box
> >>> with an intel i7). coresize is unlimited:
> >>>
> >>> ulimit -a
> >>> core file size (blocks, -c) unlimited
> >>
> >> That looks good.
> >>
> >> In reviewing the email thread, it's not entirely clear: are you calling abort() or MPI_Abort()? MPI_Abort() won't drop a core file. abort() should.
> >>
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>