On Mar 24, 2011, at 3:57 PM, Hugo Meyer wrote:
2011/3/24 Ralph Castain <firstname.lastname@example.org>
You really don't want to do it that way - you'll create a major confusion in mpirun and the other daemons about who is where. Have you looked at the code in orte/mca/errmgr/hnp/errmgr_hnp.c, line 1573 and following?
I did not look at that, but i will do it right now.
The ability to relocate a failed child process is already in the trunk - it only requires turning "on" with an --enable-recovery flag at runtime if you don't need the checkpoint/restart support. If you do need C/R, you can use that too (just requires some configure flags).
About this, i'm needing C/R support, because what i'm trying to do is to restart a process in another node(as a child of the orted residing there) from a previous checkpoint .I will take a look to the relocation feature that you are mentioning and try to use it.
From what you've described before, I suspect all you'll need to do is add some code in orte/mca/odls/base/odls_base_default_fns.c that (a) checks to see if a process in the launch message is being relocated (the construct_child_list code does that already), and then (b) sends the required info to all local child processes so they can take appropriate action.
Failure detection, re-launch, etc. have all been taken care of for you.
At the least, the cited code should provide guidance on how to correctly restart procs if you need your own errmgr module for other reasons.
Again thanks Ralph, you have been very helpful.
devel mailing list