This web mail archive is frozen.
This page is part of a frozen web archive of this mailing list.
You can still navigate around this archive, but know that no new mails
have been added to it since July of 2016.
Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.
On Mar 24, 2011, at 3:57 PM, Hugo Meyer wrote:
> 2011/3/24 Ralph Castain <rhc_at_[hidden]>
> You really don't want to do it that way - you'll create a major confusion in mpirun and the other daemons about who is where. Have you looked at the code in orte/mca/errmgr/hnp/errmgr_hnp.c, line 1573 and following?
> I did not look at that, but i will do it right now.
> The ability to relocate a failed child process is already in the trunk - it only requires turning "on" with an --enable-recovery flag at runtime if you don't need the checkpoint/restart support. If you do need C/R, you can use that too (just requires some configure flags).
> About this, i'm needing C/R support, because what i'm trying to do is to restart a process in another node(as a child of the orted residing there) from a previous checkpoint .I will take a look to the relocation feature that you are mentioning and try to use it.
From what you've described before, I suspect all you'll need to do is add some code in orte/mca/odls/base/odls_base_default_fns.c that (a) checks to see if a process in the launch message is being relocated (the construct_child_list code does that already), and then (b) sends the required info to all local child processes so they can take appropriate action.
Failure detection, re-launch, etc. have all been taken care of for you.
> At the least, the cited code should provide guidance on how to correctly restart procs if you need your own errmgr module for other reasons.
> Again thanks Ralph, you have been very helpful.
> Best regards.
> Hugo Meyer
> devel mailing list