Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC: Fix missing code in MPI_Abort functionality
From: George Bosilca (bosilca_at_[hidden])
Date: 2011-06-09 16:47:21


If this change the behavior of MPI_Abort to only abort processes on the specified communicator how this doesn't affects the default user experience (when today it aborts everything)?

If we accept the fact that MPI_Abort will only abort the processes in the current communicator what happens with the other processes in the same MPI_COMM_WORLD (but not on the communicator that has been used by MPI_Abort)? What about all the other connected processes (based on the connectivity as defined in the MPI standard in Section 10.5.4) ? Do they see this as a fault?

  george.

On Jun 9, 2011, at 16:32 , Josh Hursey wrote:

> WHAT: Fix missing code in MPI_Abort
>
> WHY: MPI_Abort is missing logic to ask for termination of the process
> group defined by the communicator
>
> WHERE: Mostly orte/mca/errmgr
>
> WHEN: Open MPI trunk
>
> TIMEOUT: Tuesday, June 14, 2011 (after teleconf)
>
> Details:
> -------------------------------------------
> A bitbucket branch is available here (last sync to r24757 of trunk)
> https://bitbucket.org/jjhursey/ompi-abort/
>
> In the MPI Standard (v2.2) Section 8.7 after the introduction of
> MPI_Abort, it states:
> "This routine makes a best attempt to abort all tasks in the group of comm."
>
> Open MPI currently only calls orte_errmgr.abort() to abort the calling
> process itself. The code to ask for the abort of the other processes
> in the group defined by the communicator is commented out. Since one
> process calling abort currently causes all processes in the job to
> abort, it has not been a big deal. However as the group starts
> exploring better resilience in the OMPI layer (with further support
> from the ORTE layer) this aspect of MPI_Abort will become more
> necessary to get right.
>
> This branch adds back the logic necessary for a single process calling
> MPI_Abort to request, from ORTE errmgr, that a defined subgroup of
> processes be aborted. Once the request is sent to the HNP, the local
> process then calls abort on itself. The HNP requests that the defined
> subgroup of processes be terminated using the existing plm mechanisms
> for doing so.
>
> This change has no effect on the current default user experienced
> behavior of MPI_Abort.
>
> --
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://users.nccs.gov/~jjhursey
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel