Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Bug or feature?
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-12-17 15:23:08


Well that stinks! I'll take another look at it - was working for me, but...been there before!

On Dec 17, 2009, at 1:13 PM, George Bosilca wrote:

> Ralph,
>
> There seems to be some problems after this commit. The hello_world application (the MPI flavor) complete, I get all the output but in addition I have a nice message stating that my MPI application didn't call MPI_Init.
>
> [bosilca_at_dancer c]$ mpirun -np 8 --mca pml ob1 ./hello
> Hello, world, I am 5 of 8 on node04
> Hello, world, I am 7 of 8 on node04
> Hello, world, I am 0 of 8 on node03
> Hello, world, I am 1 of 8 on node03
> Hello, world, I am 3 of 8 on node03
> Hello, world, I am 6 of 8 on node04
> Hello, world, I am 2 of 8 on node03
> Hello, world, I am 4 of 8 on node04
> --------------------------------------------------------------------------
> mpirun has exited due to process rank 6 with PID 15398 on
> node node04 exiting improperly. There are two reasons this could occur:
>
> 1. this process did not call "init" before exiting, but others in
> the job did. This can cause a job to hang indefinitely while it waits
> for all processes to call "init". By rule, if one process calls "init",
> then ALL processes must call "init" prior to termination.
>
> 2. this process called "init", but exited without calling "finalize".
> By rule, all processes that call "init" MUST call "finalize" prior to
> exiting or it will be considered an "abnormal termination"
>
> This may have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here).
> --------------------------------------------------------------------------
>
> george.
>
> On Dec 17, 2009, at 14:42 , Ralph Castain wrote:
>
>> Okay, this "feature" has now been added to the devel trunk with r22329.
>>
>> Feel free to test it and let me know of problems. I have a test code for it in orte/test/mpi/early_abort.c
>>
>> On Dec 16, 2009, at 11:27 AM, Jeff Squyres wrote:
>>
>>> I think I understand you're saying:
>>>
>>> - it's ok to abort during MPI_INIT (we can rationalize it as the default error handler)
>>> - we should only abort during MPI functions
>>>
>>> Is that right? If so, I agree with your interpretation. :-) ...with one addition: it's ok to abort before MPI_INIT, because the MPI spec makes no guarantees about what happens before MPI_INIT.
>>>
>>> Specifically, I'd argue that if you "mpirun -np N a.out" and at least 1 process calls MPI_INIT, then it is reasonable for OMPI to expect there to be N MPI_INIT's. If any process exits without calling MPI_INIT -- regardless of that process' exit status -- it should be treated as an error.
>>>
>>> Don't forget that we have a barrier in MPI_INIT (in most cases), so aborting when ORTE detects that a) at least one process has called MPI_INIT, and b) at least one process has exited without calling MPI_INIT, is acceptable to me. It's also acceptable to the first point above, because all the other processes are either stuck in the MPI_INIT (either at the barrier or getting there) or haven't yet entered MPI_INIT -- and the MPI spec makes no guarantees about what happens before MPI_INIT.
>>>
>>> Does that make sense?
>>>
>>>
>>>
>>> On Dec 16, 2009, at 10:06 AM, George Bosilca wrote:
>>>
>>>> There are two citation from the MPI standard that I would like to highlight.
>>>>
>>>>> All MPI programs must contain exactly one call to an MPI initialization routine: MPI_INIT or MPI_INIT_THREAD.
>>>>
>>>>> One goal of MPI is to achieve source code portability. By this we mean that a program written using MPI and complying with the relevant language standards is portable as written, and must not require any source code changes when moved from one system to another. This explicitly does not say anything about how an MPI program is started or launched from the command line, nor what the user must do to set up the environment in which an MPI program will run. However, an implementation may require some setup to be performed before other MPI routines may be called. To provide for this, MPI includes an initialization routine MPI_INIT.
>>>>
>>>> While these two statement do not necessarily clarify the original question, they highlight an acceptable solution. Before exiting the MPI_Init function (which we don't have to assume as being collective), any "MPI-like" process can be killed without problems (we can even claim that we call the default error handler). For those that successfully exited the MPI_Init, I guess the next MPI call will have to trigger the error handler and these processes should be allowed to gracefully exit.
>>>>
>>>> So, while it is clear that the best approach is to allow even bad application to terminate, it is better if we follow what MPI describe as a "high quality implementation".
>>>>
>>>> george.
>>>>
>>>>
>>>> On Dec 15, 2009, at 23:17 , Ralph Castain wrote:
>>>>
>>>>> Understandable - and we can count on your patch in the near future, then? :-)
>>>>>
>>>>> On Dec 15, 2009, at 9:12 PM, Paul H. Hargrove wrote:
>>>>>
>>>>>> My 0.02USD says that for pragmatic reasons one should attempt to terminate the job in this case, regardless of ones opinion of this unusual application behavior.
>>>>>>
>>>>>> -Paul
>>>>>>
>>>>>> Ralph Castain wrote:
>>>>>>> Hi folks
>>>>>>>
>>>>>>> In case you didn't follow this on the user list, we had a question come up about proper OMPI behavior. Basically, the user has an application where one process decides it should cleanly terminate prior to calling MPI_Init, but all the others go ahead and enter MPI_Init. The application hangs since we don't detect the one proc's exit as an abnormal termination (no segfault, and it didn't call MPI_Init so it isn't required to call MPI_Finalize prior to termination).
>>>>>>>
>>>>>>> I can probably come up with a way to detect this scenario and abort it. But before I spend the effort chasing this down, my question to you MPI folks is:
>>>>>>>
>>>>>>> What -should- OMPI do in this situation? We have never previously detected such behavior - was this an oversight, or is this simply a "bad" application?
>>>>>>>
>>>>>>> Thanks
>>>>>>> Ralph
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> devel_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Paul H. Hargrove PHHargrove_at_[hidden]
>>>>>> Future Technologies Group Tel: +1-510-495-2352
>>>>>> HPC Research Department Fax: +1-510-486-6900
>>>>>> Lawrence Berkeley National Laboratory
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> devel_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>
>>>
>>> --
>>> Jeff Squyres
>>> jsquyres_at_[hidden]
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel