Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC: Fix missing code in MPI_Abort functionality
From: Ralph Castain (rhc_at_[hidden])
Date: 2011-06-10 08:53:24


On Jun 10, 2011, at 6:48 AM, Josh Hursey wrote:

> Why would this patch result in zombied processes and poor cleanup?
> When ORTE receive notification of a process terminating/aborting then
> it triggers the termination of the job (without UTK's RFC) which
> should ensure a clean shutdown. This patch just tells ORTE that a few
> other processes should be the first to die, which will trigger the
> same response in ORTE.
>
> I guess I'm unclear about this concern since it should be a concern in
> the current ORTE as well then. I agree that it will be a concern once
> we have the OMPI layer handling error management triggered off of a
> callback, but that is a different RFC.

My comment was to "the future" - i.e., looking to the point where we get layered, rolling aborts.

I agree that this specific RFC won't change the current behavior, and as I said, I have no issue with it.

>
>
> Something that might help those listening to this thread. The current
> behavior of MPI_Abort in OMPI results in the semantics of:
> --------------
> internal_MPI_Abort(MPI_COMM_SELF, exit_code)
> --------------
> regardless of the communicator actually passed to the MPI_Abort at the
> application level. It should be:
> --------------
> internal_MPI_Abort(comm_provided, exit_code)
> --------------
>
> Semantically, this patch just makes the group actually being aborted
> match the communicator provided. In practicality, the job will
> terminate when any process in the job calls abort - so the result (in
> todays codebase) will be the same.
>
> -- Josh
>
>
> On Fri, Jun 10, 2011 at 7:30 AM, Ralph Castain <rhc_at_[hidden]> wrote:
>> I have no issue with uncommenting the code. However, I do see a future littered with lots of zombied processes and complaints over poor cleanup again....
>>
>>
>> On Jun 9, 2011, at 6:08 PM, Joshua Hursey wrote:
>>
>>> Ah I see what you are getting at now.
>>>
>>> The construction of the list of connected processes is something I, intentionally, did not modify from the current Open MPI code. The list is calculated based on the locally known set of local and remote process groups attached to the communicator. So this is the set of directly connected processes in the specified communicator known to the calling process at the OMPI level.
>>>
>>> ORTE is asked to abort this defined set of processes. Once those processes are terminated then ORTE needs to eventually inform all of the processes (in the jobid(s) specified - maybe other jobids too?) that these processes have failed/aborted. Upon notification of the failed/aborted processes the local process (at the OMPI level) needs to determine if that process loss is critical based upon the error handlers attached to communicators that it shares with the failed/aborted processes. That should be handled in the callback from the errmgr at the OMPI level, since connectedness is an MPI construct. If the process failure/abort is critical to the local process, then upon notification the local process can call abort on the communicator effected.
>>>
>>> So this has the possibility for a rolling abort effect [the abort of one communicator triggers the abort of another due to MPI_ERR_ARE_FATAL]. From which (depending upon the error handlers at the user level) the system will eventually converge to either some stable subset of process or all processes aborting resulting in job termination.
>>>
>>> The rolling abort effect relies heavily upon the ability of the runtime to make sure that all process failures/abort are eventually known to all alive processes. Since all alive processes will know of the failure/abort, it can then determine if they are transitively effected by the failure based upon the local list of communicators and associated error handlers. But to complete this aspect of the abort procedure, we do need the callback mechanism from the runtime - but since ORTE (today) will kill the job for OMPI then it is not a big deal for end users since the job will terminate anyway. Once we have the callback, then we can finish tightening up the OMPI layer code.
>>>
>>> It is not perfect, but I think it does address the transitive nature of the connectivity of MPI processes by relying on the runtime to provide uniform notification of failures. I figure that we will need to look over this code again and verify that the implementation of MPI_Comm_disconnect and associated underpinnings do the 'right thing' with regard to updating the communicator structures. But I think that is best addressed as a second set of patches.
>>>
>>>
>>> The goal of this patch is to put back in functionality that was commented out during the last reorganization of the errmgr. What will likely follow, once we have notification of failure/abort at the OMPI level, is a cleanup of the connected groups code paths.
>>>
>>>
>>> -- Josh
>>>
>>>
>>> On Jun 9, 2011, at 6:13 PM, George Bosilca wrote:
>>>
>>>> What I'm saying is that there is no reason to have any other type of MPI_Abort if we are not able to compute the set of connected processes.
>>>>
>>>> With this RFC the processes on the communicator on MPI_Abort will abort. Then the other processes in the same MPI_COMM_WORLD (in fact jobid) will be notified (if we suppose that the ORTE will not make a difference between aborted and faulty). As a result the entire MPI_COMM_WORLD will be aborted, if we consider a sane application where everyone use the same type of error handler. However, this is not enough. We have to distribute the abort signal to every other process "connected", and I don't see how we can compute this list of connected processes in Open MPI today.It is not that I don't see it in your patch, it is that the definition of the connectivity in the MPI standard is transitive and relies heavily on a correct implementation for the MPI_Comm_disconnect.
>>>>
>>>> george.
>>>>
>>>> On Jun 9, 2011, at 16:59 , Josh Hursey wrote:
>>>>
>>>>> On Thu, Jun 9, 2011 at 4:47 PM, George Bosilca <bosilca_at_[hidden]> wrote:
>>>>>> If this change the behavior of MPI_Abort to only abort processes on the specified communicator how this doesn't affects the default user experience (when today it aborts everything)?
>>>>>
>>>>> Open MPI does abort everything by default - decided by the runtime at
>>>>> the moment (but addressed in your RFC). So it does not matter if one
>>>>> process aborts or if many do. So the behavior of MPI_Abort experienced
>>>>> by the user will not change. Effectively the only change is an extra
>>>>> message in the runtime before the process actually calls
>>>>> errmgr.abort().
>>>>>
>>>>> This branch just makes the implementation complete by first telling
>>>>> ORTE that a group of processes, defined by the communicator, should be
>>>>> terminated along with the calling process. Currently ORTE notices that
>>>>> there was an abort, and terminates the job. Once your RFC goes through
>>>>> then this may no longer be the case, and OMPI can determine what to do
>>>>> when it receives a process failure notification.
>>>>>
>>>>>>
>>>>>> If we accept the fact that MPI_Abort will only abort the processes in the current communicator what happens with the other processes in the same MPI_COMM_WORLD (but not on the communicator that has been used by MPI_Abort)?
>>>>>
>>>>> Currently, ORTE will abort them as well. When your RFC goes through
>>>>> then the OMPI layer will be notified of the error and can take the
>>>>> appropriate action, as determined by the MPI standard.
>>>>>
>>>>>> What about all the other connected processes (based on the connectivity as defined in the MPI standard in Section 10.5.4) ? Do they see this as a fault?
>>>>>
>>>>> They are informed of the fault via the ORTE errmgr callback routine
>>>>> (that we have an RFC for), and then can take the appropriate action
>>>>> based on MPI semantics. So we are pushing the decision of the
>>>>> implication of the fault to the OMPI layer - where it should be.
>>>>>
>>>>>
>>>>> The remainder of the OMPI layer logic for MPI_ERRORS_RETURN and other
>>>>> connected error management scenarios is not included in this patch
>>>>> since that depends on there being a callback to the OMPI layer - which
>>>>> does not exist just yet. So a small patch to wire in the ORTE piece to
>>>>> allow the OMPI layer to request a set of processes to be terminated -
>>>>> to more accurately support MPI_Abort semantics.
>>>>>
>>>>> Does that answer your questions?
>>>>>
>>>>> -- Josh
>>>>>
>>>>>
>>>>>>
>>>>>> george.
>>>>>>
>>>>>> On Jun 9, 2011, at 16:32 , Josh Hursey wrote:
>>>>>>
>>>>>>> WHAT: Fix missing code in MPI_Abort
>>>>>>>
>>>>>>> WHY: MPI_Abort is missing logic to ask for termination of the process
>>>>>>> group defined by the communicator
>>>>>>>
>>>>>>> WHERE: Mostly orte/mca/errmgr
>>>>>>>
>>>>>>> WHEN: Open MPI trunk
>>>>>>>
>>>>>>> TIMEOUT: Tuesday, June 14, 2011 (after teleconf)
>>>>>>>
>>>>>>> Details:
>>>>>>> -------------------------------------------
>>>>>>> A bitbucket branch is available here (last sync to r24757 of trunk)
>>>>>>> https://bitbucket.org/jjhursey/ompi-abort/
>>>>>>>
>>>>>>> In the MPI Standard (v2.2) Section 8.7 after the introduction of
>>>>>>> MPI_Abort, it states:
>>>>>>> "This routine makes a best attempt to abort all tasks in the group of comm."
>>>>>>>
>>>>>>> Open MPI currently only calls orte_errmgr.abort() to abort the calling
>>>>>>> process itself. The code to ask for the abort of the other processes
>>>>>>> in the group defined by the communicator is commented out. Since one
>>>>>>> process calling abort currently causes all processes in the job to
>>>>>>> abort, it has not been a big deal. However as the group starts
>>>>>>> exploring better resilience in the OMPI layer (with further support
>>>>>>> from the ORTE layer) this aspect of MPI_Abort will become more
>>>>>>> necessary to get right.
>>>>>>>
>>>>>>> This branch adds back the logic necessary for a single process calling
>>>>>>> MPI_Abort to request, from ORTE errmgr, that a defined subgroup of
>>>>>>> processes be aborted. Once the request is sent to the HNP, the local
>>>>>>> process then calls abort on itself. The HNP requests that the defined
>>>>>>> subgroup of processes be terminated using the existing plm mechanisms
>>>>>>> for doing so.
>>>>>>>
>>>>>>> This change has no effect on the current default user experienced
>>>>>>> behavior of MPI_Abort.
>>>>>>>
>>>>>>> --
>>>>>>> Joshua Hursey
>>>>>>> Postdoctoral Research Associate
>>>>>>> Oak Ridge National Laboratory
>>>>>>> http://users.nccs.gov/~jjhursey
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> devel_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> devel_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Joshua Hursey
>>>>> Postdoctoral Research Associate
>>>>> Oak Ridge National Laboratory
>>>>> http://users.nccs.gov/~jjhursey
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>
>
>
> --
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://users.nccs.gov/~jjhursey
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel