Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI devel] Retrying a MPI_SEND
From: Hugo Daniel Meyer (meyer.hugo_at_[hidden])
Date: 2011-11-18 10:29:46


Hello again.

I was doing some trace into de PML_OB1 files. I start to follow a
MPI_Ssend() trying to find where a message is stored (in the sender) if it
is not send until the receiver post the recv, but i didn't find that place.

I've noticed that the message to be sent enters in *
mca_pml_ob1_rndv_completion_request(*pml_ob1_sendreq.c*) *and the *rc =
send_request_pml_complete_check(sendreq) *returns false when the request
hasn't been completed, but the execution never passes through *
MCA_PML_OB1_PROGRESS_PENDING,* at least, none of the possible options is
executed.

So, re-orienting my question: where is stored this message until delivery?
and if there any way to know that the receiver goes down? With this
information i will be able to detect the failure of the receiver and will
try to resend the message to another place.

Thanks again.

Hugo Meyer

2011/11/17 Hugo Daniel Meyer <meyer.hugo_at_[hidden]>

> Hello @ll.
>
> I'm doing some changes in the communication framework. Right now i'm
> working on a "secure" MPI_Send, this send needs to know when an endpoint
> goes down, and then retry the communication constructing a new endpoint, or
> at least, overwriting the data of the old endpoint with the new address of
> the receiver process. Overwriting the data of the endpoint is not a problem
> anymore, because i've done that before.
>
> For example, if we consider a Master/Worker application, where the master
> sends data to the workers, and workers start the computation, then, the
> master posts a send to the worker1 that fails and get restarted in another
> node and in his new location the worker1 posts the recv to the master's
> send. The problem here is that the master post the send when the process
> was residing in one node, but the process expects the message in another
> node. I need the sender to realize that the process is now in another node,
> and retries the communication with a modificated endpoint. Anyone could
> please tell me where in the send code i can obtain the status of a message
> that hasn't been send and resend it to a new location. Also i want to know,
> where can i obtain information about an endpoint fail?.
>
> Thanks in advance.
>
> Hugo
>