MPI_Cancel is a tricky beast, and should be handled with extreme care. From my perspective, your problem is not related to a specific implementation, but to you usage of the MPI_Cancel.
You state the MPI_Wait is not supposed to hang but I don't see anything in the MPI standard allowing you to state this? If you are referring to the first paragraph on 3.8 (regarding MPI_Cancel), then I have to disagree with you. You have to pay attention to the wording of the standard to see the trick.
> Either the cancellation succeeds, or the communication succeeds, but not both.
This is the definition of a successful cancellation, that is the base of every other action that happen on the request. As the MPI_Cancel is only defined as a local operation, an MPI library the send the matching info for the persistent request in MPI_Start, will have a hard time canceling the request.
Now, imagine a run where the receiver manage to cancel his request as it has not been matched (and this can be done locally). As the sender sent the matching information on MPI_Start, when it reach the MPI_Cancel it cannot cancel the request locally, so the cancel will fail. The sender will therefore be blocked on the MPI_Wait, which the receiver will happily wait on the MPI_Finalize.
On Feb 7, 2011, at 04:54 , Tobias Hilbrich wrote:
> Hi all,
> I am with the ZIH developers working on VampirTrace and just discovered a possibly erroneous behavior of OpenMPI (v1.4.3). I am canceling an active persistent request created with MPI_Ssend_init, in a successive MPI_Wait call the process hangs, even though according to the MPI standard this should never happen.
> The pesudo code is as follows:
> if (rank == 0)
> MPI_Ssend_init (&buf, 1, MPI_INT, 1, 666, MPI_COMM_WORLD, &r);
> if (rank == 1)
> MPI_Recv_init (&buf, 1, MPI_INT, 0, 666, MPI_COMM_WORLD, &r);
> MPI_Start (&r);
> MPI_Cancel (&r);
> MPI_Wait (&r, &status);
> MPI_Request_free (&r);
> The full (minimal reproducer) source code along with a dump of ompi_info is attached.
> Either I am missing some passage of the standard mentioning that it is forbidden to cancel an synchronous send or there appears to be an error in OpenMPI's implementation. If it is already fixed, sorry for the spam.
> (Note: changing the Ssend to Send or Bsend removes the hang)
> Dipl.-Inf. Tobias Hilbrich
> Wissenschaftlicher Mitarbeiter
> Technische Universitaet Dresden
> Zentrum fuer Informationsdienste und Hochleistungsrechnen (ZIH)
> (Center for Information Services and High Performance Computing (ZIH))
> Interdisziplinäre Anwenderunterstützung und Koordination
> (Interdisciplinary Application Development and Coordination)
> 01062 Dresden
> Tel.: +49 (351) 463-32041
> Fax: +49 (351) 463-37773
> E-Mail: tobias.hilbrich_at_[hidden]
> devel mailing list