Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] [PATCH 3/4] Trying to get the C/R code to compile again. (recv_*_nb)
From: Jeff Squyres (jsquyres) (jsquyres_at_[hidden])
Date: 2013-12-04 10:46:21


On Nov 25, 2013, at 9:59 AM, Adrian Reber <adrian_at_[hidden]> wrote:

> @@ -5616,16 +5597,8 @@ static int do_send_msg_detail(ompi_crcp_bkmrk_pml_peer_ref_t *peer_ref,
> /*
> * Recv the ACK msg
> */
> - if ( 0 > (ret = ompi_rte_recv_buffer(&peer_ref->proc_name, buffer,
> - OMPI_CRCP_COORD_BOOKMARK_TAG, 0) ) ) {
> - opal_output(mca_crcp_bkmrk_component.super.output_handle,
> - "crcp:bkmrk: do_send_msg_detail: %s --> %s Failed to receive ACK buffer from peer. Return %d\n",
> - OMPI_NAME_PRINT(OMPI_PROC_MY_NAME),
> - OMPI_NAME_PRINT(&(peer_ref->proc_name)),
> - ret);
> - exit_status = ret;
> - goto cleanup;
> - }
> + ompi_rte_recv_buffer_nb(&peer_ref->proc_name, OMPI_CRCP_COORD_BOOKMARK_TAG, 0,
> + orte_rml_recv_callback, NULL);
>
> UNPACK_BUFFER(buffer, recv_response, 1, OPAL_UINT32,
> "crcp:bkmrk: send_msg_details: Failed to unpack the ACK from peer buffer.");

I see a bunch of hunks like this that I do not think are correct.

They're replacing orte_rte_recv_buffer() with orte_rte_recv_buffer_nb(), which, by definition, may not actually complete the receive. Hence, the receive buffer may not be filled by the time orte_rte_recv_buffer_nb() returns, and therefore you can't proceed with the FT processing. E.g., the UNPACK_BUFFER() statement here may well be undefined because the buffer isn't full yet.

I'm not sure what this means in terms of overall FT processing -- the fact that UNPACK_BUFFER is erroneous may be fairly obvious, but the other FT processing that occurs below UNPACK_BUFFER, and, indeed, by the caller of this function, may not be able to proceed until this receive completes, either.

Ditto for all the other hunks like this.

If I'm not mistaken, the actual receive will occur in a progress thread, so the cbfunc supplied to orte_rte_recv_buffer_nb() will need to do something in a thread safe manner -- I'm not sure if it will need to transfer control back up to the main thread, or if the FT processing in question is safe to occur in the ORTE progress thread (it would be *much* better if it could occur on the ORTE progress thread so that we can get asynchronous progress of this stuff).

If I'm correct, then all of those hunks will need to be adapted. I.e., these aren't just compile errors to be fixed -- they'll also require architectural changes, too.

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/