On Nov 18, 2011, at 07:29 , Hugo Daniel Meyer wrote:
I was doing some trace into de PML_OB1 files. I start to follow a MPI_Ssend() trying to find where a message is stored (in the sender) if it is not send until the receiver post the recv, but i didn't find that place.
Right, you can't find this as the message is not stored on the sender. The pointer to the send request is sent encapsulated in the matching header, and the receiver will provide it back once the message has been matched (this means the data is now ready to flow).
I've noticed that the message to be sent enters in mca_pml_ob1_rndv_completion_request(pml_ob1_sendreq.c) and the rc = send_request_pml_complete_check(sendreq) returns false when the request hasn't been completed, but the execution never passes through MCA_PML_OB1_PROGRESS_PENDING, at least, none of the possible options is executed.
So, re-orienting my question: where is stored this message until delivery? and if there any way to know that the receiver goes down? With this information i will be able to detect the failure of the receiver and will try to resend the message to another place.
If you want to track the send requests, you will have to implement your own way of tracking them, as we do not expose this in our PML. Eventually, writing your own PML, might be necessary.
However, as a user I would find very disturbing that the MPI runtime decide to send the message to another peer on my behalf. I would rather prefer that the MPI_Send returns some kind of error, that allows the upper level algorithm to repost the send to another peer. Look at the proposals in the MPI Forum to get more information about what it is discussed regarding the MPI resilience.
2011/11/17 Hugo Daniel Meyer <firstname.lastname@example.org>
I'm doing some changes in the communication framework. Right now i'm working on a "secure" MPI_Send, this send needs to know when an endpoint goes down, and then retry the communication constructing a new endpoint, or at least, overwriting the data of the old endpoint with the new address of the receiver process. Overwriting the data of the endpoint is not a problem anymore, because i've done that before.
For example, if we consider a Master/Worker application, where the master sends data to the workers, and workers start the computation, then, the master posts a send to the worker1 that fails and get restarted in another node and in his new location the worker1 posts the recv to the master's send. The problem here is that the master post the send when the process was residing in one node, but the process expects the message in another node. I need the sender to realize that the process is now in another node, and retries the communication with a modificated endpoint. Anyone could please tell me where in the send code i can obtain the status of a message that hasn't been send and resend it to a new location. Also i want to know, where can i obtain information about an endpoint fail?.
Thanks in advance.
devel mailing list