On Mar 24, 2011, at 3:57 PM, Hugo Meyer wrote:
> 2011/3/24 Ralph Castain <rhc_at_[hidden]>
> You really don't want to do it that way - you'll create a major confusion in mpirun and the other daemons about who is where. Have you looked at the code in orte/mca/errmgr/hnp/errmgr_hnp.c, line 1573 and following?
> I did not look at that, but i will do it right now.
> The ability to relocate a failed child process is already in the trunk - it only requires turning "on" with an --enable-recovery flag at runtime if you don't need the checkpoint/restart support. If you do need C/R, you can use that too (just requires some configure flags).
> About this, i'm needing C/R support, because what i'm trying to do is to restart a process in another node(as a child of the orted residing there) from a previous checkpoint .I will take a look to the relocation feature that you are mentioning and try to use it.
From what you've described before, I suspect all you'll need to do is add some code in orte/mca/odls/base/odls_base_default_fns.c that (a) checks to see if a process in the launch message is being relocated (the construct_child_list code does that already), and then (b) sends the required info to all local child processes so they can take appropriate action.
Failure detection, re-launch, etc. have all been taken care of for you.
> At the least, the cited code should provide guidance on how to correctly restart procs if you need your own errmgr module for other reasons.
> Again thanks Ralph, you have been very helpful.
> Best regards.
> Hugo Meyer
> devel mailing list