Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2007-09-05 21:01:11


I opened https://svn.open-mpi.org/trac/ompi/ticket/1144 to track this
issue.

On Aug 20, 2007, at 9:04 AM, Daniel Spångberg wrote:

> Dear Sven,
>
> I thought about doing that and experimented a bit as well, but
> there are
> some problems then: I need to relink the users code, registering an
> atexit
> function is tricky from the fortran code, and I still need to know
> whether
> MPI_Finalize (and as it turns out MPI_Init as well, otherwise there's
> problems with things like call system) has been called before my
> atexit
> routine is called...
>
> Best regards
> Daniel
>
> On Mon, 20 Aug 2007 14:37:44 +0200, Sven Stork <stork_at_[hidden]> wrote:
>
>> instead of doing dirty with the library you could try to register a
>> cleanup
>> function with atexit.
>>
>> Thanks,
>> Sven
>>
>> On Friday 17 August 2007 19:59, Daniel Spångberg wrote:
>>> Dear George,
>>>
>>> I think that the best way is to call MPI_Abort. However, this
>>> forces the
>>> user to modify the code, which I already have suggested. But their
>>> application is not calling exit directly, I merely wrote the
>>> simplest
>>> code
>>> that demonstrates the problem. Their application is a Fortran
>>> program
>>> and
>>> during file IO, when something bad happens, the fortran runtime
>>> (pgi)
>>> calls exit (and sometimes _exit for some reason). The file IO is
>>> only
>>> done
>>> in one process. I have told them to try to add
>>> ERR=linelo,END=lineno,
>>> where the code at lineno calls MPI_Abort. This has not happened yet.
>>> Nevertheless, openmpi does not terminate the application when one of
>>> processes exits without MPI_Finalize, contrary to the content of
>>> mpirun
>>> man-page. I have currently "solved" the problem by writing a .so
>>> that is
>>> LD_PRELOAD:ed, checking whether MPI_Finalize is indeed called
>>> between
>>> MPI_Init and exit/_exit. I'd rather not keep this "solution" for too
>>> long.
>>> If it is indeed so that the mpirun man-page is wrong and the code
>>> right,
>>> I'd rather push the proper error-handling solution.
>>>
>>> Best regards
>>> Daniel Spångberg
>>>
>>>
>>> On Fri, 17 Aug 2007 18:25:17 +0200, George Bosilca
>>> <bosilca_at_[hidden]>
>>> wrote:
>>>
>>>> The MPI standard state that the correct way to abort/kill an MPI
>>>> application is using the MPI_Abort function. Except, if you're
>>>> doing
>>>> some kind of fault tolerance stuff, there is no reason to end
>>>> one of
>>>> your MPI processes via exit.
>>>>
>>>> Thanks,
>>>> george.
>>>>
>>>> On Aug 16, 2007, at 12:04 PM, Daniel Spångberg wrote:
>>>>
>>>>> Dear Open-MPI user list members,
>>>>>
>>>>> I am currently having a user with an application where one of the
>>>>> MPI-processes die, but the openmpi-system does not kill the
>>>>> rest of
>>>>> the
>>>>> application.
>>>>>
>>>>> Since the mpirun man page states the following I would expect
>>>>> it to
>>>>> take
>>>>> care of killing the application if a process exits without calling
>>>>> MPI_Finalize:
>>>>>
>>>>> Process Termination / Signal Handling
>>>>> During the run of an MPI application, if any rank dies
>>>>> abnormally
>>>>> (either exiting before invoking MPI_FINALIZE, or dying as the
>>>>> result of a signal), mpirun will print out an error
>>>>> message
>>>>> and
>>>>> kill the rest of the MPI application.
>>>>>
>>>>> The following test program demonstrates the behaviour (program
>>>>> hangs until
>>>>> it is killed by the user or batch system):
>>>>>
>>>>> #include <stdio.h>
>>>>> #include <stdlib.h>
>>>>> #include <unistd.h>
>>>>> #include <mpi.h>
>>>>>
>>>>> #define RANK_DEATH 1
>>>>>
>>>>> int main(int argc, char **argv)
>>>>> {
>>>>> int rank;
>>>>> MPI_Init(&argc,&argv);
>>>>> MPI_Comm_rank(MPI_COMM_WORLD,&rank);
>>>>>
>>>>> sleep(10);
>>>>> if (rank==RANK_DEATH)
>>>>> exit(1);
>>>>> sleep(10);
>>>>> MPI_Finalize();
>>>>> return 0;
>>>>> }
>>>>>
>>>>> I have tested this on openmpi 1.2.1 as well as the latest stable
>>>>> 1.2.3. I
>>>>> am on Linux x86_64.
>>>>>
>>>>> Is this a bug, or are there some flags I can use to force the
>>>>> mpirun (or
>>>>> orted, or...) to kill the whole MPI program when this happens?
>>>>>
>>>>> If one of the application processes die from a signal (I have
>>>>> tested SEGV
>>>>> and FPE) rather than just exiting the whole application is indeed
>>>>> killed.
>>>>>
>>>>> Best regards
>>>>> Daniel Spångberg
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
Cisco Systems