I'm needing some help to restart the communication with a process that i restore in a different node. My situation is as follows:
The process fails and it's restored in another node succesfully from a previous checkpoint that i sent there. Now, when a process try to send a message to this restored process it will fail, or at least, it will be locked in ompi_request_wait_completion.
So, when this happens i have to send a message to the daemon of the sender that will have the uri of where the process has been restored and answer to the proc with this and it will update this info.
So, i need to know where in the code i can capture this attempt to send and then send the message to his daemon and where and how i can update this info to send the message to the right place (Same rank but new uri).
I have to do it in this way to avoid a collective communication.
If you give me a hand on this, it will be great.