> From what you've described before, I suspect all you'll need to do is add
> some code in orte/mca/odls/base/odls_base_default_fns.c that (a) checks to
> see if a process in the launch message is being relocated (the
> construct_child_list code does that already), and then (b) sends the
> required info to all local child processes so they can take appropriate
> Failure detection, re-launch, etc. have all been taken care of for you.
I looked at the code that you mentioned me and i realize that i have two
possible options, that i'm going to share with you to know your opinion.
First of all i will let you know my actual situation with the
implementation. As i'm working in a Fault Tolerant system, but using
uncoordinated checkpoint i'm taking checkpoints of all my process at
different time and storing them on the machine where there are residing, but
i also send this checkpoints to another node (lets call it protector), so if
this node fails his process should be restarted in the protector that have
Right now i'm detecting the failure of a process and i know where this
process should be restarted, and also i have the checkpoint in the
protector. And i also have the child information of course.
So, my options are:
I detect the failure, and then i use
orte_errmgr_hnp_base_global_update_state() with some modifications and the
hnp_relocate but changing the spawning to make a restart from a checkpoint,
i suposse that using this, the migration of the process to another node will
be updated and everyone will know it, because is the hnp who is going to do
this (is this ok?).
Modify one of the spawn variations(probably the remote_spawn from rsh) in
the PLM framework and then use the orted_comm to command a remote_spawn in
the protector, but i don't know here how to update the info so everyone
knows about the change or how this is managed.
I might be very wrong in what I said, my apologies if so.
Thanks a lot for all the help.