Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Add child to another parent.
From: Ralph Castain (rhc_at_[hidden])
Date: 2011-03-24 15:20:16

You really don't want to do it that way - you'll create a major confusion in mpirun and the other daemons about who is where. Have you looked at the code in orte/mca/errmgr/hnp/errmgr_hnp.c, line 1573 and following?

The ability to relocate a failed child process is already in the trunk - it only requires turning "on" with an --enable-recovery flag at runtime if you don't need the checkpoint/restart support. If you do need C/R, you can use that too (just requires some configure flags).

At the least, the cited code should provide guidance on how to correctly restart procs if you need your own errmgr module for other reasons.

On Mar 24, 2011, at 7:56 AM, Hugo Meyer wrote:

> Hello @ll.
> I'm trying to restart a child that has failed, now i'm catching the failed child in the errmgr and then i'm packing the child and sending it to another node who has to "adopt" it. Is there any way to do this with te actual implementation? something like add_child. Because the i will have to do somethin like:
> opal_list_item_t *item;
> orte_odls_job_t *jobdat;
> orte_app_context_t *app;
> for (item = opal_list_get_first(&orte_local_jobdata);
> item != opal_list_get_end(&orte_local_jobdata);
> item = opal_list_get_next(item)) {
> jobdat = (orte_odls_job_t*)item;
> if (jobdat->jobid == child->name->jobid) {
> break;
> }
> }
> app = jobdat->apps[child->app_idx];
> In order to do this, i need to have the child in the jobdat. If there is not such thing implemented, could someone give me an advice on how to do this.
> Best Regards.
> Hugo Meyer
> _______________________________________________
> devel mailing list
> devel_at_[hidden]