Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Bug or feature?
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-12-16 13:27:15


I think I understand you're saying:

- it's ok to abort during MPI_INIT (we can rationalize it as the default error handler)
- we should only abort during MPI functions

Is that right? If so, I agree with your interpretation. :-) ...with one addition: it's ok to abort before MPI_INIT, because the MPI spec makes no guarantees about what happens before MPI_INIT.

Specifically, I'd argue that if you "mpirun -np N a.out" and at least 1 process calls MPI_INIT, then it is reasonable for OMPI to expect there to be N MPI_INIT's. If any process exits without calling MPI_INIT -- regardless of that process' exit status -- it should be treated as an error.

Don't forget that we have a barrier in MPI_INIT (in most cases), so aborting when ORTE detects that a) at least one process has called MPI_INIT, and b) at least one process has exited without calling MPI_INIT, is acceptable to me. It's also acceptable to the first point above, because all the other processes are either stuck in the MPI_INIT (either at the barrier or getting there) or haven't yet entered MPI_INIT -- and the MPI spec makes no guarantees about what happens before MPI_INIT.

Does that make sense?

On Dec 16, 2009, at 10:06 AM, George Bosilca wrote:

> There are two citation from the MPI standard that I would like to highlight.
>
> > All MPI programs must contain exactly one call to an MPI initialization routine: MPI_INIT or MPI_INIT_THREAD.
>
> > One goal of MPI is to achieve source code portability. By this we mean that a program written using MPI and complying with the relevant language standards is portable as written, and must not require any source code changes when moved from one system to another. This explicitly does not say anything about how an MPI program is started or launched from the command line, nor what the user must do to set up the environment in which an MPI program will run. However, an implementation may require some setup to be performed before other MPI routines may be called. To provide for this, MPI includes an initialization routine MPI_INIT.
>
> While these two statement do not necessarily clarify the original question, they highlight an acceptable solution. Before exiting the MPI_Init function (which we don't have to assume as being collective), any "MPI-like" process can be killed without problems (we can even claim that we call the default error handler). For those that successfully exited the MPI_Init, I guess the next MPI call will have to trigger the error handler and these processes should be allowed to gracefully exit.
>
> So, while it is clear that the best approach is to allow even bad application to terminate, it is better if we follow what MPI describe as a "high quality implementation".
>
> george.
>
>
> On Dec 15, 2009, at 23:17 , Ralph Castain wrote:
>
> > Understandable - and we can count on your patch in the near future, then? :-)
> >
> > On Dec 15, 2009, at 9:12 PM, Paul H. Hargrove wrote:
> >
> >> My 0.02USD says that for pragmatic reasons one should attempt to terminate the job in this case, regardless of ones opinion of this unusual application behavior.
> >>
> >> -Paul
> >>
> >> Ralph Castain wrote:
> >>> Hi folks
> >>>
> >>> In case you didn't follow this on the user list, we had a question come up about proper OMPI behavior. Basically, the user has an application where one process decides it should cleanly terminate prior to calling MPI_Init, but all the others go ahead and enter MPI_Init. The application hangs since we don't detect the one proc's exit as an abnormal termination (no segfault, and it didn't call MPI_Init so it isn't required to call MPI_Finalize prior to termination).
> >>>
> >>> I can probably come up with a way to detect this scenario and abort it. But before I spend the effort chasing this down, my question to you MPI folks is:
> >>>
> >>> What -should- OMPI do in this situation? We have never previously detected such behavior - was this an oversight, or is this simply a "bad" application?
> >>>
> >>> Thanks
> >>> Ralph
> >>>
> >>>
> >>> _______________________________________________
> >>> devel mailing list
> >>> devel_at_[hidden]
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>
> >>
> >>
> >> --
> >> Paul H. Hargrove PHHargrove_at_[hidden]
> >> Future Technologies Group Tel: +1-510-495-2352
> >> HPC Research Department Fax: +1-510-486-6900
> >> Lawrence Berkeley National Laboratory
> >> _______________________________________________
> >> devel mailing list
> >> devel_at_[hidden]
> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >
> >
> > _______________________________________________
> > devel mailing list
> > devel_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>

-- 
Jeff Squyres
jsquyres_at_[hidden]