Hi Open MPI developers,
I found a small bug in Open MPI.
See attached program cancelled.c.
In this program, rank 1 tries to cancel a MPI_Irecv and calls a MPI_Recv
instead if the cancellation succeeds. This program should terminate whether
the cancellation succeeds or not. But it leads a deadlock in MPI_Recv after
printing "MPI_Test_cancelled: 1".
I confirmed it works fine with MPICH2.
The problem is in mca_pml_ob1_recv_request_cancel function in
ompi/mca/pml/ob1/pml_ob1_recvreq.c. It accepts the cancellation unless
the request has been completed. I think it should not accept the
cancellation if the request has been matched. If it want to accept the
cancellation, it must push the recv frag to the unexpected message queue
back and redo matching. Furthermore, the receive buffer must be reverted
if the received message has been written to the receive buffer partially
in a pipeline protocol.
Attached patch cancel-recv.patch is a sample fix for this bug for Open MPI
trunk. Though this patch has 65 lines, main modifications are adding one
if-statement and deleting one if-statement. Other lines are just for
I cannot confirm the MEMCHECKER part is correct. Could anyone review it
MPI development team,