Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Retrying a MPI_SEND
From: Hugo Daniel Meyer (meyer.hugo_at_[hidden])
Date: 2011-11-18 10:29:46

Hello again.

I was doing some trace into de PML_OB1 files. I start to follow a
MPI_Ssend() trying to find where a message is stored (in the sender) if it
is not send until the receiver post the recv, but i didn't find that place.

I've noticed that the message to be sent enters in *
mca_pml_ob1_rndv_completion_request(*pml_ob1_sendreq.c*) *and the *rc =
send_request_pml_complete_check(sendreq) *returns false when the request
hasn't been completed, but the execution never passes through *
MCA_PML_OB1_PROGRESS_PENDING,* at least, none of the possible options is

So, re-orienting my question: where is stored this message until delivery?
and if there any way to know that the receiver goes down? With this
information i will be able to detect the failure of the receiver and will
try to resend the message to another place.

Thanks again.

Hugo Meyer

2011/11/17 Hugo Daniel Meyer <meyer.hugo_at_[hidden]>

> Hello @ll.
> I'm doing some changes in the communication framework. Right now i'm
> working on a "secure" MPI_Send, this send needs to know when an endpoint
> goes down, and then retry the communication constructing a new endpoint, or
> at least, overwriting the data of the old endpoint with the new address of
> the receiver process. Overwriting the data of the endpoint is not a problem
> anymore, because i've done that before.
> For example, if we consider a Master/Worker application, where the master
> sends data to the workers, and workers start the computation, then, the
> master posts a send to the worker1 that fails and get restarted in another
> node and in his new location the worker1 posts the recv to the master's
> send. The problem here is that the master post the send when the process
> was residing in one node, but the process expects the message in another
> node. I need the sender to realize that the process is now in another node,
> and retries the communication with a modificated endpoint. Anyone could
> please tell me where in the send code i can obtain the status of a message
> that hasn't been send and resend it to a new location. Also i want to know,
> where can i obtain information about an endpoint fail?.
> Thanks in advance.
> Hugo