Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI devel] [patch] MPI_Cancel should not cancel a request if it has a matched recv frag
From: George Bosilca (bosilca_at_[hidden])
Date: 2012-07-26 10:46:18


Rich,

There is no matching in this case. Canceling a receive operation is possible only up to the moment the request has been matched. Up to this point the sequence numbers of the peers are not used, so removing a non-matched request has no impact on the sequence number.

  george.

On Jul 26, 2012, at 16:31 , Richard Graham wrote:

> I do not see any resetting of sequence numbers. It has been a long time since I have looked at the matching code, so don't know if the out-of-order handling has been taken out. If not, the sequence number has to be dealt with in some manner, or else there will be a gap in the arriving sequence numbers, and the matching logic will prevent any further progress.
>
> Rich
>
> -----Original Message-----
> From: devel-bounces_at_[hidden] [mailto:devel-bounces_at_[hidden]] On Behalf Of George Bosilca
> Sent: Thursday, July 26, 2012 10:06 AM
> To: Open MPI Developers
> Subject: Re: [OMPI devel] [patch] MPI_Cancel should not cancel a request if it has a matched recv frag
>
> Takahiro,
>
> Indeed we were way to lax on canceling the requests. I modified your patch to correctly deal with the MEMCHECK macro (remove the call from the branch that will requires a completion function). The modified patch is attached below. I will commit asap.
>
> Thanks,
> george.
>
>
> Index: ompi/mca/pml/ob1/pml_ob1_recvreq.c
> ===================================================================
> --- ompi/mca/pml/ob1/pml_ob1_recvreq.c (revision 26870)
> +++ ompi/mca/pml/ob1/pml_ob1_recvreq.c (working copy)
> @@ -3,7 +3,7 @@
> * Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
> * University Research and Technology
> * Corporation. All rights reserved.
> - * Copyright (c) 2004-2009 The University of Tennessee and The University
> + * Copyright (c) 2004-2012 The University of Tennessee and The
> + University
> * of Tennessee Research Foundation. All rights
> * reserved.
> * Copyright (c) 2004-2008 High Performance Computing Center Stuttgart, @@ -15,6 +15,7 @@
> * Copyright (c) 2012 NVIDIA Corporation. All rights reserved.
> * Copyright (c) 2011-2012 Los Alamos National Security, LLC. All rights
> * reserved.
> + * Copyright (c) 2012 FUJITSU LIMITED. All rights reserved.
> * $COPYRIGHT$
> *
> * Additional copyrights may follow
> @@ -97,36 +98,26 @@
> mca_pml_ob1_recv_request_t* request = (mca_pml_ob1_recv_request_t*)ompi_request;
> mca_pml_ob1_comm_t* comm = request->req_recv.req_base.req_comm->c_pml_comm;
>
> - if( true == ompi_request->req_complete ) { /* way to late to cancel this one */
> - /*
> - * Receive request completed, make user buffer accessable.
> - */
> - MEMCHECKER(
> - memchecker_call(&opal_memchecker_base_mem_defined,
> - request->req_recv.req_base.req_addr,
> - request->req_recv.req_base.req_count,
> - request->req_recv.req_base.req_datatype);
> - );
> + if( true == request->req_match_received ) { /* way to late to cancel this one */
> + assert( OMPI_ANY_TAG != ompi_request->req_status.MPI_TAG ); /*
> + not matched isn't it */
> return OMPI_SUCCESS;
> }
>
> /* The rest should be protected behind the match logic lock */
> OPAL_THREAD_LOCK(&comm->matching_lock);
> - if( OMPI_ANY_TAG == ompi_request->req_status.MPI_TAG ) { /* the match has not been already done */
> - if( request->req_recv.req_base.req_peer == OMPI_ANY_SOURCE ) {
> - opal_list_remove_item( &comm->wild_receives, (opal_list_item_t*)request );
> - } else {
> - mca_pml_ob1_comm_proc_t* proc = comm->procs + request->req_recv.req_base.req_peer;
> - opal_list_remove_item(&proc->specific_receives, (opal_list_item_t*)request);
> - }
> - PERUSE_TRACE_COMM_EVENT( PERUSE_COMM_REQ_REMOVE_FROM_POSTED_Q,
> - &(request->req_recv.req_base), PERUSE_RECV );
> - /**
> - * As now the PML is done with this request we have to force the pml_complete
> - * to true. Otherwise, the request will never be freed.
> - */
> - request->req_recv.req_base.req_pml_complete = true;
> + if( request->req_recv.req_base.req_peer == OMPI_ANY_SOURCE ) {
> + opal_list_remove_item( &comm->wild_receives, (opal_list_item_t*)request );
> + } else {
> + mca_pml_ob1_comm_proc_t* proc = comm->procs + request->req_recv.req_base.req_peer;
> + opal_list_remove_item(&proc->specific_receives,
> + (opal_list_item_t*)request);
> }
> + PERUSE_TRACE_COMM_EVENT( PERUSE_COMM_REQ_REMOVE_FROM_POSTED_Q,
> + &(request->req_recv.req_base), PERUSE_RECV );
> + /**
> + * As now the PML is done with this request we have to force the pml_complete
> + * to true. Otherwise, the request will never be freed.
> + */
> + request->req_recv.req_base.req_pml_complete = true;
> OPAL_THREAD_UNLOCK(&comm->matching_lock);
>
> OPAL_THREAD_LOCK(&ompi_request_lock);
> @@ -138,7 +129,7 @@
> MCA_PML_OB1_RECV_REQUEST_MPI_COMPLETE(request);
> OPAL_THREAD_UNLOCK(&ompi_request_lock);
> /*
> - * Receive request cancelled, make user buffer accessable.
> + * Receive request cancelled, make user buffer accessible.
> */
> MEMCHECKER(
> memchecker_call(&opal_memchecker_base_mem_defined,
>
> On Jul 26, 2012, at 13:41 , Kawashima, Takahiro wrote:
>
>> Hi Open MPI developers,
>>
>> I found a small bug in Open MPI.
>>
>> See attached program cancelled.c.
>> In this program, rank 1 tries to cancel a MPI_Irecv and calls a
>> MPI_Recv instead if the cancellation succeeds. This program should
>> terminate whether the cancellation succeeds or not. But it leads a
>> deadlock in MPI_Recv after printing "MPI_Test_cancelled: 1".
>> I confirmed it works fine with MPICH2.
>>
>> The problem is in mca_pml_ob1_recv_request_cancel function in
>> ompi/mca/pml/ob1/pml_ob1_recvreq.c. It accepts the cancellation unless
>> the request has been completed. I think it should not accept the
>> cancellation if the request has been matched. If it want to accept the
>> cancellation, it must push the recv frag to the unexpected message
>> queue back and redo matching. Furthermore, the receive buffer must be
>> reverted if the received message has been written to the receive
>> buffer partially in a pipeline protocol.
>>
>> Attached patch cancel-recv.patch is a sample fix for this bug for Open
>> MPI trunk. Though this patch has 65 lines, main modifications are
>> adding one if-statement and deleting one if-statement. Other lines are
>> just for indent alignment.
>> I cannot confirm the MEMCHECKER part is correct. Could anyone review
>> it before committing?
>>
>> Regards,
>>
>> Takahiro Kawashima,
>> MPI development team,
>> Fujitsu
>> <cancelled.c><cancel-recv.patch><License.txt>_________________________
>> ______________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel