Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] [PATCH 3/4] Trying to get the C/R code to compile again. (recv_*_nb)
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2013-12-06 09:34:25


Per my other email, I would suggest #ifdef comments instead of nonblocking
replacements for the blocking calls. After that modification, I think this
patch is fine. As was mentioned previously, we will need to go back (after
things compile) and figure out a new model for this behavior.

For the existing calls to recv_buffer_nb, converting those to the versions
without return codes is completely fine and correct. So leave those changes
in there. It is just when you move from recv_buffer to recv_buffer_nb that
I would leave some kind of marker for yourself and preserve the prior code.

After that modification, I think it is ready to be committed.

Thanks!
Josh

On Wed, Dec 4, 2013 at 9:46 AM, Jeff Squyres (jsquyres)
<jsquyres_at_[hidden]>wrote:

> On Nov 25, 2013, at 9:59 AM, Adrian Reber <adrian_at_[hidden]> wrote:
>
> > @@ -5616,16 +5597,8 @@ static int
> do_send_msg_detail(ompi_crcp_bkmrk_pml_peer_ref_t *peer_ref,
> > /*
> > * Recv the ACK msg
> > */
> > - if ( 0 > (ret = ompi_rte_recv_buffer(&peer_ref->proc_name, buffer,
> > - OMPI_CRCP_COORD_BOOKMARK_TAG,
> 0) ) ) {
> > - opal_output(mca_crcp_bkmrk_component.super.output_handle,
> > - "crcp:bkmrk: do_send_msg_detail: %s --> %s Failed
> to receive ACK buffer from peer. Return %d\n",
> > - OMPI_NAME_PRINT(OMPI_PROC_MY_NAME),
> > - OMPI_NAME_PRINT(&(peer_ref->proc_name)),
> > - ret);
> > - exit_status = ret;
> > - goto cleanup;
> > - }
> > + ompi_rte_recv_buffer_nb(&peer_ref->proc_name,
> OMPI_CRCP_COORD_BOOKMARK_TAG, 0,
> > + orte_rml_recv_callback, NULL);
> >
> > UNPACK_BUFFER(buffer, recv_response, 1, OPAL_UINT32,
> > "crcp:bkmrk: send_msg_details: Failed to unpack the
> ACK from peer buffer.");
>
> I see a bunch of hunks like this that I do not think are correct.
>
> They're replacing orte_rte_recv_buffer() with orte_rte_recv_buffer_nb(),
> which, by definition, may not actually complete the receive. Hence, the
> receive buffer may not be filled by the time orte_rte_recv_buffer_nb()
> returns, and therefore you can't proceed with the FT processing. E.g., the
> UNPACK_BUFFER() statement here may well be undefined because the buffer
> isn't full yet.
>
> I'm not sure what this means in terms of overall FT processing -- the fact
> that UNPACK_BUFFER is erroneous may be fairly obvious, but the other FT
> processing that occurs below UNPACK_BUFFER, and, indeed, by the caller of
> this function, may not be able to proceed until this receive completes,
> either.
>
> Ditto for all the other hunks like this.
>
> If I'm not mistaken, the actual receive will occur in a progress thread,
> so the cbfunc supplied to orte_rte_recv_buffer_nb() will need to do
> something in a thread safe manner -- I'm not sure if it will need to
> transfer control back up to the main thread, or if the FT processing in
> question is safe to occur in the ORTE progress thread (it would be *much*
> better if it could occur on the ORTE progress thread so that we can get
> asynchronous progress of this stuff).
>
> If I'm correct, then all of those hunks will need to be adapted. I.e.,
> these aren't just compile errors to be fixed -- they'll also require
> architectural changes, too.
>
> --
> Jeff Squyres
> jsquyres_at_[hidden]
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>

-- 
Joshua Hursey
Assistant Professor of Computer Science
University of Wisconsin-La Crosse
http://cs.uwlax.edu/~jjhursey