Are you talking about an MPI communication? If so, then you need to update every proc's modex info for the proc that moved - this is something stored in each MPI proc's memory, so it isn't something that you can just get from the daemon on-demand. You'll have to provide the update to every single proc directly so that it has the info if/when it should decide to send an MPI message to the proc that moved.
This is why we do a modex upon restart - sending the change to every MPI proc is hardly scalable minus a collective operation.
See the modex database interface in orte/mca/grpcomm/base/grpcomm_base_modex.c. You'll have to create new code to send/recv an update message, but the code to update the database entry exists.
On Jun 2, 2011, at 7:52 AM, Hugo Meyer wrote:
My actual problem is that i don't know where is the struct that has the information that is used to send messages to the procs.
Because what i need is to update it when i move a process from its original site, is there something like this??
Thanks a lot.
2011/5/31 Hugo Meyer <email@example.com>
I'm needing some help to restart the communication with a process that i restore in a different node. My situation is as follows:
The process fails and it's restored in another node succesfully from a previous checkpoint that i sent there. Now, when a process try to send a message to this restored process it will fail, or at least, it will be locked in ompi_request_wait_completion.
So, when this happens i have to send a message to the daemon of the sender that will have the uri of where the process has been restored and answer to the proc with this and it will update this info.
So, i need to know where in the code i can capture this attempt to send and then send the message to his daemon and where and how i can update this info to send the message to the right place (Same rank but new uri).
I have to do it in this way to avoid a collective communication.
If you give me a hand on this, it will be great.
devel mailing list