Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Add child to another parent.
From: Hugo Meyer (meyer.hugo_at_[hidden])
Date: 2011-03-30 14:57:26


Thanks Ralph.
I have finished the (a) point, and now its working, now i have to work to
relaunch from my checkpoint as you said.

Best regards.

Hugo Meyer

2011/3/29 Ralph Castain <rhc_at_[hidden]>

> The resilient mapper -only- works on procs being restarted - it cannot map
> a job for its initial launch. You shouldn't set any rmaps flag and things
> will work correctly - the default round-robin mapper will map the initial
> launch, and then the resilient mapper will handle restarts.
>
>
> On Mar 29, 2011, at 5:18 AM, Hugo Meyer wrote:
>
> Ralph.
>
> I'm having a problem when i try to select the rmaps resilient to be used:
>
> /home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun -v -np 4 --hostfile
> ../hostfile --bynode -mca rmaps resilient -mca vprotocol receiver -mca plm
> rsh -mca routed cm ./coll 6 10 2>out.txt
>
>
> I get this as error:
>
> [clus9:25568] [[53334,0],0] hostfile: checking hostfile ../hostfile for
> nodes
> --------------------------------------------------------------------------
> Your job failed to map. Either no mapper was available, or none
> of the available mappers was able to perform the requested
> mapping operation. This can happen if you request a map type
> (e.g., loadbalance) and the corresponding mapper was not built.
>
> --------------------------------------------------------------------------
> [clus9:25568] errmgr:hnp:update_state() [[53334,0],0]) ------- App. Process
> state updated for process NULL
> [clus9:25568] [[53334,0],0] errmgr:hnp: job [53334,0] reported state NEVER
> LAUNCHED for proc NULL state UNDEFINED pid 0 exit_code 1
> [clus9:25568] [[53334,0],0] errmgr:hnp: job [53334,0] reported state NEVER
> LAUNCHED
> [clus9:25568] [[53334,0],0] errmgr:hnp: abort called on job [53334,0] with
> status 1
>
>
> Is there a flag that i'm not turning on? or a component that i should have
> selected?
>
> Thanks again.
>
> Hugo Meyer
>
>
> 2011/3/26 Hugo Meyer <meyer.hugo_at_[hidden]>
>
>> Ok Ralph.
>>
>> Thanks a lot for your help, i will do as you said and then let you know
>> how it goes.
>>
>> Best Regards.
>>
>> Hugo Meyer
>>
>>
>> 2011/3/25 Ralph Castain <rhc_at_[hidden]>
>>
>>>
>>> On Mar 25, 2011, at 10:48 AM, Hugo Meyer wrote:
>>>
>>> From what you've described before, I suspect all you'll need to do is add
>>>> some code in orte/mca/odls/base/odls_base_default_fns.c that (a) checks to
>>>> see if a process in the launch message is being relocated (the
>>>> construct_child_list code does that already), and then (b) sends the
>>>> required info to all local child processes so they can take appropriate
>>>> action.
>>>>
>>>> Failure detection, re-launch, etc. have all been taken care of for you.
>>>>
>>>
>>>
>>> I looked at the code that you mentioned me and i realize that i have two
>>> possible options, that i'm going to share with you to know your opinion.
>>>
>>> First of all i will let you know my actual situation with the
>>> implementation. As i'm working in a Fault Tolerant system, but using
>>> uncoordinated checkpoint i'm taking checkpoints of all my process at
>>> different time and storing them on the machine where there are residing, but
>>> i also send this checkpoints to another node (lets call it protector), so if
>>> this node fails his process should be restarted in the protector that have
>>> his checkpoints.
>>>
>>> Right now i'm detecting the failure of a process and i know where this
>>> process should be restarted, and also i have the checkpoint in the
>>> protector. And i also have the child information of course.
>>>
>>> So, my options are:
>>> *First Option*
>>> *
>>> *
>>> I detect the failure, and then i use
>>> orte_errmgr_hnp_base_global_update_state() with some modifications and the
>>> hnp_relocate but changing the spawning to make a restart from a checkpoint,
>>> i suposse that using this, the migration of the process to another node will
>>> be updated and everyone will know it, because is the hnp who is going to do
>>> this (is this ok?).
>>>
>>>
>>> This is the option I would use. The other one is much, much more work. In
>>> this option, you only have to:
>>>
>>> (a) modify the mapper so you can specify the location of the proc being
>>> restarted. The resilient mapper module will be handling the restart - if you
>>> look at orte/mca/rmaps/resilient/rmaps_resilient.c, you can see the code
>>> doing the "replacement" and modify accordingly.
>>>
>>> (b) add any required info about your checkpoint to the launch message.
>>> This gets created in orte/mca/odls/base/odls_base_default_fns.c, the
>>> "get_add_procs_data" function (at the top of the file).
>>>
>>> (c) modify the launch code to handle your checkpoint, if required - see
>>> the file in (b), the "construct_child" and "launch" functions.
>>>
>>> HTH
>>> Ralph
>>>
>>>
>>>
>>> *Second Option*
>>> *
>>> *
>>> Modify one of the spawn variations(probably the remote_spawn from rsh) in
>>> the PLM framework and then use the orted_comm to command a remote_spawn in
>>> the protector, but i don't know here how to update the info so everyone
>>> knows about the change or how this is managed.
>>>
>>> I might be very wrong in what I said, my apologies if so.
>>>
>>> Thanks a lot for all the help.
>>>
>>> Best regards.
>>>
>>> Hugo Meyer
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>
>>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>