I'm doing some changes in the communication framework. Right now i'm working on a "secure" MPI_Send, this send needs to know when an endpoint goes down, and then retry the communication constructing a new endpoint, or at least, overwriting the data of the old endpoint with the new address of the receiver process. Overwriting the data of the endpoint is not a problem anymore, because i've done that before.
For example, if we consider a Master/Worker application, where the master sends data to the workers, and workers start the computation, then, the master posts a send to the worker1 that fails and get restarted in another node and in his new location the worker1 posts the recv to the master's send. The problem here is that the master post the send when the process was residing in one node, but the process expects the message in another node. I need the sender to realize that the process is now in another node, and retries the communication with a modificated endpoint. Anyone could please tell me where in the send code i can obtain the status of a message that hasn't been send and resend it to a new location. Also i want to know, where can i obtain information about an endpoint fail?.
Thanks in advance.