Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Sven Stork (stork_at_[hidden])
Date: 2007-08-20 08:37:44


instead of doing dirty with the library you could try to register a cleanup
function with atexit.

Thanks,
  Sven

On Friday 17 August 2007 19:59, Daniel Spångberg wrote:
> Dear George,
>
> I think that the best way is to call MPI_Abort. However, this forces the
> user to modify the code, which I already have suggested. But their
> application is not calling exit directly, I merely wrote the simplest code
> that demonstrates the problem. Their application is a Fortran program and
> during file IO, when something bad happens, the fortran runtime (pgi)
> calls exit (and sometimes _exit for some reason). The file IO is only done
> in one process. I have told them to try to add ERR=linelo,END=lineno,
> where the code at lineno calls MPI_Abort. This has not happened yet.
> Nevertheless, openmpi does not terminate the application when one of
> processes exits without MPI_Finalize, contrary to the content of mpirun
> man-page. I have currently "solved" the problem by writing a .so that is
> LD_PRELOAD:ed, checking whether MPI_Finalize is indeed called between
> MPI_Init and exit/_exit. I'd rather not keep this "solution" for too long.
> If it is indeed so that the mpirun man-page is wrong and the code right,
> I'd rather push the proper error-handling solution.
>
> Best regards
> Daniel Spångberg
>
>
> On Fri, 17 Aug 2007 18:25:17 +0200, George Bosilca <bosilca_at_[hidden]>
> wrote:
>
> > The MPI standard state that the correct way to abort/kill an MPI
> > application is using the MPI_Abort function. Except, if you're doing
> > some kind of fault tolerance stuff, there is no reason to end one of
> > your MPI processes via exit.
> >
> > Thanks,
> > george.
> >
> > On Aug 16, 2007, at 12:04 PM, Daniel Spångberg wrote:
> >
> >> Dear Open-MPI user list members,
> >>
> >> I am currently having a user with an application where one of the
> >> MPI-processes die, but the openmpi-system does not kill the rest of
> >> the
> >> application.
> >>
> >> Since the mpirun man page states the following I would expect it to
> >> take
> >> care of killing the application if a process exits without calling
> >> MPI_Finalize:
> >>
> >> Process Termination / Signal Handling
> >> During the run of an MPI application, if any rank dies
> >> abnormally
> >> (either exiting before invoking MPI_FINALIZE, or dying as the
> >> result of a signal), mpirun will print out an error message
> >> and
> >> kill the rest of the MPI application.
> >>
> >> The following test program demonstrates the behaviour (program
> >> hangs until
> >> it is killed by the user or batch system):
> >>
> >> #include <stdio.h>
> >> #include <stdlib.h>
> >> #include <unistd.h>
> >> #include <mpi.h>
> >>
> >> #define RANK_DEATH 1
> >>
> >> int main(int argc, char **argv)
> >> {
> >> int rank;
> >> MPI_Init(&argc,&argv);
> >> MPI_Comm_rank(MPI_COMM_WORLD,&rank);
> >>
> >> sleep(10);
> >> if (rank==RANK_DEATH)
> >> exit(1);
> >> sleep(10);
> >> MPI_Finalize();
> >> return 0;
> >> }
> >>
> >> I have tested this on openmpi 1.2.1 as well as the latest stable
> >> 1.2.3. I
> >> am on Linux x86_64.
> >>
> >> Is this a bug, or are there some flags I can use to force the
> >> mpirun (or
> >> orted, or...) to kill the whole MPI program when this happens?
> >>
> >> If one of the application processes die from a signal (I have
> >> tested SEGV
> >> and FPE) rather than just exiting the whole application is indeed
> >> killed.
> >>
> >> Best regards
> >> Daniel Spångberg
> >> _______________________________________________
> >> users mailing list
> >> users_at_[hidden]
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>