I'm doing some changes in the communication framework. Right now i'm
working on a "secure" MPI_Send, this send needs to know when an endpoint
goes down, and then retry the communication constructing a new endpoint, or
at least, overwriting the data of the old endpoint with the new address of
the receiver process. Overwriting the data of the endpoint is not a problem
anymore, because i've done that before.
For example, if we consider a Master/Worker application, where the master
sends data to the workers, and workers start the computation, then, the
master posts a send to the worker1 that fails and get restarted in another
node and in his new location the worker1 posts the recv to the master's
send. The problem here is that the master post the send when the process
was residing in one node, but the process expects the message in another
node. I need the sender to realize that the process is now in another node,
and retries the communication with a modificated endpoint. Anyone could
please tell me where in the send code i can obtain the status of a message
that hasn't been send and resend it to a new location. Also i want to know,
where can i obtain information about an endpoint fail?.
Thanks in advance.