Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Daniel Spångberg (daniels_at_[hidden])
Date: 2007-08-20 09:04:48


Dear Sven,

I thought about doing that and experimented a bit as well, but there are
some problems then: I need to relink the users code, registering an atexit
function is tricky from the fortran code, and I still need to know whether
MPI_Finalize (and as it turns out MPI_Init as well, otherwise there's
problems with things like call system) has been called before my atexit
routine is called...

Best regards
Daniel

On Mon, 20 Aug 2007 14:37:44 +0200, Sven Stork <stork_at_[hidden]> wrote:

> instead of doing dirty with the library you could try to register a
> cleanup
> function with atexit.
>
> Thanks,
> Sven
>
> On Friday 17 August 2007 19:59, Daniel Spångberg wrote:
>> Dear George,
>>
>> I think that the best way is to call MPI_Abort. However, this forces the
>> user to modify the code, which I already have suggested. But their
>> application is not calling exit directly, I merely wrote the simplest
>> code
>> that demonstrates the problem. Their application is a Fortran program
>> and
>> during file IO, when something bad happens, the fortran runtime (pgi)
>> calls exit (and sometimes _exit for some reason). The file IO is only
>> done
>> in one process. I have told them to try to add ERR=linelo,END=lineno,
>> where the code at lineno calls MPI_Abort. This has not happened yet.
>> Nevertheless, openmpi does not terminate the application when one of
>> processes exits without MPI_Finalize, contrary to the content of mpirun
>> man-page. I have currently "solved" the problem by writing a .so that is
>> LD_PRELOAD:ed, checking whether MPI_Finalize is indeed called between
>> MPI_Init and exit/_exit. I'd rather not keep this "solution" for too
>> long.
>> If it is indeed so that the mpirun man-page is wrong and the code right,
>> I'd rather push the proper error-handling solution.
>>
>> Best regards
>> Daniel Spångberg
>>
>>
>> On Fri, 17 Aug 2007 18:25:17 +0200, George Bosilca
>> <bosilca_at_[hidden]>
>> wrote:
>>
>> > The MPI standard state that the correct way to abort/kill an MPI
>> > application is using the MPI_Abort function. Except, if you're doing
>> > some kind of fault tolerance stuff, there is no reason to end one of
>> > your MPI processes via exit.
>> >
>> > Thanks,
>> > george.
>> >
>> > On Aug 16, 2007, at 12:04 PM, Daniel Spångberg wrote:
>> >
>> >> Dear Open-MPI user list members,
>> >>
>> >> I am currently having a user with an application where one of the
>> >> MPI-processes die, but the openmpi-system does not kill the rest of
>> >> the
>> >> application.
>> >>
>> >> Since the mpirun man page states the following I would expect it to
>> >> take
>> >> care of killing the application if a process exits without calling
>> >> MPI_Finalize:
>> >>
>> >> Process Termination / Signal Handling
>> >> During the run of an MPI application, if any rank dies
>> >> abnormally
>> >> (either exiting before invoking MPI_FINALIZE, or dying as the
>> >> result of a signal), mpirun will print out an error message
>> >> and
>> >> kill the rest of the MPI application.
>> >>
>> >> The following test program demonstrates the behaviour (program
>> >> hangs until
>> >> it is killed by the user or batch system):
>> >>
>> >> #include <stdio.h>
>> >> #include <stdlib.h>
>> >> #include <unistd.h>
>> >> #include <mpi.h>
>> >>
>> >> #define RANK_DEATH 1
>> >>
>> >> int main(int argc, char **argv)
>> >> {
>> >> int rank;
>> >> MPI_Init(&argc,&argv);
>> >> MPI_Comm_rank(MPI_COMM_WORLD,&rank);
>> >>
>> >> sleep(10);
>> >> if (rank==RANK_DEATH)
>> >> exit(1);
>> >> sleep(10);
>> >> MPI_Finalize();
>> >> return 0;
>> >> }
>> >>
>> >> I have tested this on openmpi 1.2.1 as well as the latest stable
>> >> 1.2.3. I
>> >> am on Linux x86_64.
>> >>
>> >> Is this a bug, or are there some flags I can use to force the
>> >> mpirun (or
>> >> orted, or...) to kill the whole MPI program when this happens?
>> >>
>> >> If one of the application processes die from a signal (I have
>> >> tested SEGV
>> >> and FPE) rather than just exiting the whole application is indeed
>> >> killed.
>> >>
>> >> Best regards
>> >> Daniel Spångberg
>> >> _______________________________________________
>> >> users mailing list
>> >> users_at_[hidden]
>> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >
>> >
>> > _______________________________________________
>> > users mailing list
>> > users_at_[hidden]
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >
>> >
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>