Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: [OMPI devel] RFC: Fix missing code in MPI_Abort functionality
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2011-06-09 16:32:53

WHAT: Fix missing code in MPI_Abort

WHY: MPI_Abort is missing logic to ask for termination of the process
group defined by the communicator

WHERE: Mostly orte/mca/errmgr

WHEN: Open MPI trunk

TIMEOUT: Tuesday, June 14, 2011 (after teleconf)

A bitbucket branch is available here (last sync to r24757 of trunk)

In the MPI Standard (v2.2) Section 8.7 after the introduction of
MPI_Abort, it states:
 "This routine makes a best attempt to abort all tasks in the group of comm."

Open MPI currently only calls orte_errmgr.abort() to abort the calling
process itself. The code to ask for the abort of the other processes
in the group defined by the communicator is commented out. Since one
process calling abort currently causes all processes in the job to
abort, it has not been a big deal. However as the group starts
exploring better resilience in the OMPI layer (with further support
from the ORTE layer) this aspect of MPI_Abort will become more
necessary to get right.

This branch adds back the logic necessary for a single process calling
MPI_Abort to request, from ORTE errmgr, that a defined subgroup of
processes be aborted. Once the request is sent to the HNP, the local
process then calls abort on itself. The HNP requests that the defined
subgroup of processes be terminated using the existing plm mechanisms
for doing so.

This change has no effect on the current default user experienced
behavior of MPI_Abort.

Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory