The resilient mapper -only- works on procs being restarted - it cannot map a job for its initial launch. You shouldn't set any rmaps flag and things will work correctly - the default round-robin mapper will map the initial launch, and then the resilient mapper will handle restarts.


On Mar 29, 2011, at 5:18 AM, Hugo Meyer wrote:

Ralph.

I'm having a problem when i try to select the rmaps resilient to be used:

/home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun -v -np 4 --hostfile ../hostfile --bynode -mca rmaps resilient -mca vprotocol receiver -mca plm rsh -mca routed cm ./coll 6 10 2>out.txt 

I get this as error:
[clus9:25568] [[53334,0],0] hostfile: checking hostfile ../hostfile for nodes
--------------------------------------------------------------------------
Your job failed to map. Either no mapper was available, or none
of the available mappers was able to perform the requested
mapping operation. This can happen if you request a map type
(e.g., loadbalance) and the corresponding mapper was not built.

--------------------------------------------------------------------------
[clus9:25568] errmgr:hnp:update_state() [[53334,0],0]) ------- App. Process state updated for process NULL
[clus9:25568] [[53334,0],0] errmgr:hnp: job [53334,0] reported state NEVER LAUNCHED for proc NULL state UNDEFINED pid 0 exit_code 1
[clus9:25568] [[53334,0],0] errmgr:hnp: job [53334,0] reported state NEVER LAUNCHED
[clus9:25568] [[53334,0],0] errmgr:hnp: abort called on job [53334,0] with status 1

Is there a flag that i'm not turning on? or a component that i should have selected?

Thanks again.

Hugo Meyer


2011/3/26 Hugo Meyer <meyer.hugo@gmail.com>
Ok Ralph.

Thanks a lot for your help, i will do as you said and then let you know how it goes.

Best Regards.

Hugo Meyer


2011/3/25 Ralph Castain <rhc@open-mpi.org>

On Mar 25, 2011, at 10:48 AM, Hugo Meyer wrote:

From what you've described before, I suspect all you'll need to do is add some code in orte/mca/odls/base/odls_base_default_fns.c that (a) checks to see if a process in the launch message is being relocated (the construct_child_list code does that already), and then (b) sends the required info to all local child processes so they can take appropriate action.

Failure detection, re-launch, etc. have all been taken care of for you.


I looked at the code that you mentioned me and i realize that i have two possible options, that i'm going to share with you to know your opinion.

First of all i will let you know my actual situation with the implementation. As i'm working in a Fault Tolerant system, but using uncoordinated checkpoint i'm taking checkpoints of all my process at different time and storing them on the machine where there are residing, but i also send this checkpoints to another node (lets call it protector), so if this node fails his process should be restarted in the protector that have his checkpoints.

Right now i'm detecting the failure of a process and i know where this process should be restarted, and also i have the checkpoint in the protector. And i also have the child information of course.

So, my options are:
First Option

I detect the failure, and then i use orte_errmgr_hnp_base_global_update_state()  with some modifications and the hnp_relocate but changing the spawning to make a restart from a checkpoint, i suposse that using this, the migration of the process to another node will be updated and everyone will know it, because is the hnp who is going to do this (is this ok?).

This is the option I would use. The other one is much, much more work. In this option, you only have to:

(a) modify the mapper so you can specify the location of the proc being restarted. The resilient mapper module will be handling the restart - if you look at orte/mca/rmaps/resilient/rmaps_resilient.c, you can see the code doing the "replacement" and modify accordingly.

(b) add any required info about your checkpoint to the launch message. This gets created in orte/mca/odls/base/odls_base_default_fns.c, the "get_add_procs_data" function (at the top of the file).

(c) modify the launch code to handle your checkpoint, if required - see the file in (b), the "construct_child" and "launch" functions.

HTH
Ralph



Second Option

Modify one of the spawn variations(probably the remote_spawn from rsh) in the PLM framework and then use the orted_comm to command a remote_spawn in the protector, but i don't know here how to update the info so everyone knows about the change or how this is managed.

I might be very wrong in what I said, my apologies if so.

Thanks a lot for all the help.

Best regards.

Hugo Meyer

_______________________________________________
devel mailing list
devel@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


_______________________________________________
devel mailing list
devel@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


_______________________________________________
devel mailing list
devel@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel