Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2007-09-05 21:01:11


I opened https://svn.open-mpi.org/trac/ompi/ticket/1144 to track this
issue.

On Aug 20, 2007, at 9:04 AM, Daniel Spångberg wrote:

> Dear Sven,
>
> I thought about doing that and experimented a bit as well, but
> there are
> some problems then: I need to relink the users code, registering an
> atexit
> function is tricky from the fortran code, and I still need to know
> whether
> MPI_Finalize (and as it turns out MPI_Init as well, otherwise there's
> problems with things like call system) has been called before my
> atexit
> routine is called...
>
> Best regards
> Daniel
>
> On Mon, 20 Aug 2007 14:37:44 +0200, Sven Stork <stork_at_[hidden]> wrote:
>
>> instead of doing dirty with the library you could try to register a
>> cleanup
>> function with atexit.
>>
>> Thanks,
>> Sven
>>
>> On Friday 17 August 2007 19:59, Daniel Spångberg wrote:
>>> Dear George,
>>>
>>> I think that the best way is to call MPI_Abort. However, this
>>> forces the
>>> user to modify the code, which I already have suggested. But their
>>> application is not calling exit directly, I merely wrote the
>>> simplest
>>> code
>>> that demonstrates the problem. Their application is a Fortran
>>> program
>>> and
>>> during file IO, when something bad happens, the fortran runtime
>>> (pgi)
>>> calls exit (and sometimes _exit for some reason). The file IO is
>>> only
>>> done
>>> in one process. I have told them to try to add
>>> ERR=linelo,END=lineno,
>>> where the code at lineno calls MPI_Abort. This has not happened yet.
>>> Nevertheless, openmpi does not terminate the application when one of
>>> processes exits without MPI_Finalize, contrary to the content of
>>> mpirun
>>> man-page. I have currently "solved" the problem by writing a .so
>>> that is
>>> LD_PRELOAD:ed, checking whether MPI_Finalize is indeed called
>>> between
>>> MPI_Init and exit/_exit. I'd rather not keep this "solution" for too
>>> long.
>>> If it is indeed so that the mpirun man-page is wrong and the code
>>> right,
>>> I'd rather push the proper error-handling solution.
>>>
>>> Best regards
>>> Daniel Spångberg
>>>
>>>
>>> On Fri, 17 Aug 2007 18:25:17 +0200, George Bosilca
>>> <bosilca_at_[hidden]>
>>> wrote:
>>>
>>>> The MPI standard state that the correct way to abort/kill an MPI
>>>> application is using the MPI_Abort function. Except, if you're
>>>> doing
>>>> some kind of fault tolerance stuff, there is no reason to end
>>>> one of
>>>> your MPI processes via exit.
>>>>
>>>> Thanks,
>>>> george.
>>>>
>>>> On Aug 16, 2007, at 12:04 PM, Daniel Spångberg wrote:
>>>>
>>>>> Dear Open-MPI user list members,
>>>>>
>>>>> I am currently having a user with an application where one of the
>>>>> MPI-processes die, but the openmpi-system does not kill the
>>>>> rest of
>>>>> the
>>>>> application.
>>>>>
>>>>> Since the mpirun man page states the following I would expect
>>>>> it to
>>>>> take
>>>>> care of killing the application if a process exits without calling
>>>>> MPI_Finalize:
>>>>>
>>>>> Process Termination / Signal Handling
>>>>> During the run of an MPI application, if any rank dies
>>>>> abnormally
>>>>> (either exiting before invoking MPI_FINALIZE, or dying as the
>>>>> result of a signal), mpirun will print out an error
>>>>> message
>>>>> and
>>>>> kill the rest of the MPI application.
>>>>>
>>>>> The following test program demonstrates the behaviour (program
>>>>> hangs until
>>>>> it is killed by the user or batch system):
>>>>>
>>>>> #include <stdio.h>
>>>>> #include <stdlib.h>
>>>>> #include <unistd.h>
>>>>> #include <mpi.h>
>>>>>
>>>>> #define RANK_DEATH 1
>>>>>
>>>>> int main(int argc, char **argv)
>>>>> {
>>>>> int rank;
>>>>> MPI_Init(&argc,&argv);
>>>>> MPI_Comm_rank(MPI_COMM_WORLD,&rank);
>>>>>
>>>>> sleep(10);
>>>>> if (rank==RANK_DEATH)
>>>>> exit(1);
>>>>> sleep(10);
>>>>> MPI_Finalize();
>>>>> return 0;
>>>>> }
>>>>>
>>>>> I have tested this on openmpi 1.2.1 as well as the latest stable
>>>>> 1.2.3. I
>>>>> am on Linux x86_64.
>>>>>
>>>>> Is this a bug, or are there some flags I can use to force the
>>>>> mpirun (or
>>>>> orted, or...) to kill the whole MPI program when this happens?
>>>>>
>>>>> If one of the application processes die from a signal (I have
>>>>> tested SEGV
>>>>> and FPE) rather than just exiting the whole application is indeed
>>>>> killed.
>>>>>
>>>>> Best regards
>>>>> Daniel Spångberg
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
Cisco Systems