Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Retrying a MPI_SEND
From: Hugo Daniel Meyer (meyer.hugo_at_[hidden])
Date: 2012-01-26 09:10:12


Hello @ll.

I'm reviving this topic because i've done things as you propose, and i
still can't catch the error mentioned before. I will put here some pieces
of code to contextualize.

I've set the error handler:

*MPI_Errhandler_set(MPI_COMM_WORLD,MPI_ERRORS_RETURN);*

Then do a send like this:

*ierr = MPI_Send(b, task.msize*task.msize, MPI_DOUBLE, 1, 152,
MPI_COMM_WORLD);*
* *
*if (ierr != MPI_SUCCESS) {*
* *
*printf("ERROR %d \n",ierr);*
 }

This send, as i mentioned before is made at the beginning of a
master/worker application by the master. When the send is made process 1
resides in Node1, but then node1 fails and the process 1 is restarted on
node2. Process 1 post the recv when in node2, but here the execution stops
without showing an error. I'm thinking that this kind of failures are not
noticed.

I've noticed that the execution stops in ompi/request/req_wait.c in
ompi_request_wait_completion(req).

*int ompi_request_default_wait(*
* ompi_request_t ** req_ptr,*
* ompi_status_public_t * status)*
*{*
* ompi_request_t *req = *req_ptr;*
* ompi_request_wait_completion(req);*
*.........*

The code is not returning from that sentence, and no handler is catching
the error. Someone know where can i search for a variable o something that
is set when and endpoint gets broken, or something similar?

Thanks in advance.

Hugo Meyer

2011/12/20 Hugo Daniel Meyer <meyer.hugo_at_[hidden]>

> Sorry for the delay.
> I will try with the MPI_ERRORS_RETURN handler, maybe that is my problem.
> Thanks a lot for your help.
>
> I'll let you know how it goes.
>
> Best regards.
>
> Hugo
>
> 2011/12/16 George Bosilca <bosilca_at_[hidden]>
>
>> Setting the error handler to MPI_ERRORS_RETURN is the right solution for
>> mechanism using the PMPI interface. Hugo is one software layer below the
>> MPI interface, so the error handler is not affecting his code. However,
>> once he reacts to an error, he should reset the error (in the status
>> attached to the request) to MPI_SUCCESS, in order to avoid triggering the
>> error handler on the way back to the MPI layer.
>>
>> george.
>>
>> On Dec 16, 2011, at 09:09 , Jeff Squyres wrote:
>>
>> > I'm jumping into the middle of this conversation and probably don't
>> have all the right context, so forgive me if this is a stupid question: did
>> you set MPI_ERRORS_RETURN on the communicator in question?
>> >
>> >
>> > On Dec 14, 2011, at 10:43 AM, Hugo Daniel Meyer wrote:
>> >
>> >> Hello George and @ll.
>> >>
>> >> Sorry for the late answer, but i was doing some trace to see where is
>> set the MPI_ERROR. I took a look to ompi_request_default_wait and try to
>> see what happen with request.
>> >>
>> >> Well, i've noticed that all requests that are not inmediately solved
>> go to ompi_request_wait_completion. But i don't know exactly where the
>> execution jumps when i inject a failure to the receiver of the message.
>> After the failure, the sender does not return from
>> ompi_request_wait_completion to ompi_request_default_wait, and i don't know
>> where to catch when the req->req_status.MPI_ERROR is set. Do you know where
>> jumps the execution? or at least in which error handler?
>> >>
>> >> Thanks in advance.
>> >>
>> >> Hugo
>> >>
>> >> 2011/12/9 George Bosilca <bosilca_at_[hidden]>
>> >>
>> >> On Dec 9, 2011, at 06:59 , Hugo Daniel Meyer wrote:
>> >>
>> >>> Hello George and all.
>> >>>
>> >>> I've been adapting some of the code to copy the request, and now i
>> think that it is working ok. I'm storing the request as you do on the
>> pessimist, but i'm only logging received messages, as my approach is a
>> pessimist log based on the receiver.
>> >>>
>> >>> I do have a question about how you detect when you have to resend a
>> message, or at least repost it?
>> >>
>> >> The error in the status attached to the request will be set in case of
>> failure. As the MPI error handler is triggered right before returning above
>> the MPI layer, at the level where you placed your interception you have all
>> the freedom you need to handle the faults.
>> >>
>> >> george.
>> >>
>> >>>
>> >>> Thanks for the help.
>> >>>
>> >>> Hugo
>> >>>
>> >>> 2011/11/19 Hugo Daniel Meyer <meyer.hugo_at_[hidden]>
>> >>>
>> >>>
>> >>> 2011/11/18 George Bosilca <bosilca_at_[hidden]>
>> >>>
>> >>> On Nov 18, 2011, at 11:50 , Hugo Daniel Meyer wrote:
>> >>>
>> >>>>
>> >>>> 2011/11/18 George Bosilca <bosilca_at_[hidden]>
>> >>>>
>> >>>> On Nov 18, 2011, at 11:14 , Hugo Daniel Meyer wrote:
>> >>>>
>> >>>>> 2011/11/18 George Bosilca <bosilca_at_[hidden]>
>> >>>>>
>> >>>>> On Nov 18, 2011, at 07:29 , Hugo Daniel Meyer wrote:
>> >>>>>
>> >>>>>> Hello again.
>> >>>>>>
>> >>>>>> I was doing some trace into de PML_OB1 files. I start to follow a
>> MPI_Ssend() trying to find where a message is stored (in the sender) if it
>> is not send until the receiver post the recv, but i didn't find that place.
>> >>>>>
>> >>>>> Right, you can't find this as the message is not stored on the
>> sender. The pointer to the send request is sent encapsulated in the
>> matching header, and the receiver will provide it back once the message has
>> been matched (this means the data is now ready to flow).
>> >>>>>
>> >>>>> So, what you're saying is that the sender only sends the header, so
>> when the receiver post the recv will send again the header so the sender
>> starts with the data sent? am i getting it right? If this is ok, the data
>> stays in the sender, but where it is stored?
>> >>>>
>> >>>> If we consider rendez-vous messages the data is remains in the
>> sender buffer (aka the buffer provided by the upper level to the MPI_Send
>> function).
>> >>>>
>> >>>> Yes, so i will only need to save the headears of the messages (where
>> the status is incomplete), and then maybe just call again the upper level
>> MP_Send. A question here, the headers are not marked as pending (at least i
>> think so), so, my only approach might be to create a list of pending
>> headers and store there the pointer to the send, then try to identify its
>> corresponding upper level MPI_Send and retries it in case of failure, is
>> this a correct approach?
>> >>>
>> >>> Look in the mca/vprotocol/base to see how we deal with the send
>> requests in our message logging protocol. We hijack the send request list,
>> and replace them with our own, allowing us to chain all active requests.
>> This make the tracking of chive requests very simple, and minimize the
>> impact on the overall code.
>> >>>
>> >>> george.
>> >>>
>> >>>
>> >>> Ok George.
>> >>> I will take a look there and then let you know how it goes.
>> >>>
>> >>> Thanks.
>> >>>
>> >>> Hugo
>> >>>
>> >>> _______________________________________________
>> >>> devel mailing list
>> >>> devel_at_[hidden]
>> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> >>>
>> >>>
>> >>> _______________________________________________
>> >>> devel mailing list
>> >>> devel_at_[hidden]
>> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> >>
>> >>
>> >> _______________________________________________
>> >> devel mailing list
>> >> devel_at_[hidden]
>> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> >>
>> >> _______________________________________________
>> >> devel mailing list
>> >> devel_at_[hidden]
>> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> >
>> >
>> > --
>> > Jeff Squyres
>> > jsquyres_at_[hidden]
>> > For corporate legal information go to:
>> > http://www.cisco.com/web/about/doing_business/legal/cri/
>> >
>> >
>> > _______________________________________________
>> > devel mailing list
>> > devel_at_[hidden]
>> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
>