Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Add child to another parent.
From: Hugo Meyer (meyer.hugo_at_[hidden])
Date: 2011-03-30 17:09:21


Hello again.

I'm working in the launch code to handle my checkpoints, but i'm a little
stuck in how to set the path to my checkpoint and the executable
(ompi_blcr_context.PID). I take a look at the code in
odls_base_default_fns.c and this piece of code took my attention:

#if OPAL_ENABLE_FT_CR == 1
            /*
             * OPAL CRS components need the opportunity to take action
before a process
             * is forked.
             * Needs access to:
             * - Environment
             * - Rank/ORTE Name
             * - Binary to exec
             */
            if( NULL != opal_crs.crs_prelaunch ) {
                if( OPAL_SUCCESS != (rc =
opal_crs.crs_prelaunch(child->name->vpid,

orte_sstore_base_prelaunch_location,

&(app->app),

&(app->cwd),

&(app->argv),
                                                                 &(app->env)
) ) ) {
                    ORTE_ERROR_LOG(rc);
                    goto CLEANUP;
                }
            }
#endif

But i didn't find out how to set orte_sstore_base_prelaunch_location, i now
that initially this is set in the sstore_base_open. For example, as i'm
transfering my checkpoint from one node to another, i store the checkpoint
that has to be restore in /tmp/1/ and it has a name like
ompi_blcr_context.PID.

Is there any function that i didn't see that allows me to do this? I'm
asking this because I do not want to change the signature of the functions
to pass the details of the checkpoint and the PID.

Best Regards.

Hugo Meyer

2011/3/30 Hugo Meyer <meyer.hugo_at_[hidden]>

> Thanks Ralph.
> I have finished the (a) point, and now its working, now i have to work to
> relaunch from my checkpoint as you said.
>
> Best regards.
>
> Hugo Meyer
>
>
> 2011/3/29 Ralph Castain <rhc_at_[hidden]>
>
>> The resilient mapper -only- works on procs being restarted - it cannot map
>> a job for its initial launch. You shouldn't set any rmaps flag and things
>> will work correctly - the default round-robin mapper will map the initial
>> launch, and then the resilient mapper will handle restarts.
>>
>>
>> On Mar 29, 2011, at 5:18 AM, Hugo Meyer wrote:
>>
>> Ralph.
>>
>> I'm having a problem when i try to select the rmaps resilient to be used:
>>
>> /home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun -v -np 4 --hostfile
>> ../hostfile --bynode -mca rmaps resilient -mca vprotocol receiver -mca plm
>> rsh -mca routed cm ./coll 6 10 2>out.txt
>>
>>
>> I get this as error:
>>
>> [clus9:25568] [[53334,0],0] hostfile: checking hostfile ../hostfile for
>> nodes
>> --------------------------------------------------------------------------
>> Your job failed to map. Either no mapper was available, or none
>> of the available mappers was able to perform the requested
>> mapping operation. This can happen if you request a map type
>> (e.g., loadbalance) and the corresponding mapper was not built.
>>
>> --------------------------------------------------------------------------
>> [clus9:25568] errmgr:hnp:update_state() [[53334,0],0]) ------- App.
>> Process state updated for process NULL
>> [clus9:25568] [[53334,0],0] errmgr:hnp: job [53334,0] reported state NEVER
>> LAUNCHED for proc NULL state UNDEFINED pid 0 exit_code 1
>> [clus9:25568] [[53334,0],0] errmgr:hnp: job [53334,0] reported state NEVER
>> LAUNCHED
>> [clus9:25568] [[53334,0],0] errmgr:hnp: abort called on job [53334,0] with
>> status 1
>>
>>
>> Is there a flag that i'm not turning on? or a component that i should have
>> selected?
>>
>> Thanks again.
>>
>> Hugo Meyer
>>
>>
>> 2011/3/26 Hugo Meyer <meyer.hugo_at_[hidden]>
>>
>>> Ok Ralph.
>>>
>>> Thanks a lot for your help, i will do as you said and then let you know
>>> how it goes.
>>>
>>> Best Regards.
>>>
>>> Hugo Meyer
>>>
>>>
>>> 2011/3/25 Ralph Castain <rhc_at_[hidden]>
>>>
>>>>
>>>> On Mar 25, 2011, at 10:48 AM, Hugo Meyer wrote:
>>>>
>>>> From what you've described before, I suspect all you'll need to do is
>>>>> add some code in orte/mca/odls/base/odls_base_default_fns.c that (a) checks
>>>>> to see if a process in the launch message is being relocated (the
>>>>> construct_child_list code does that already), and then (b) sends the
>>>>> required info to all local child processes so they can take appropriate
>>>>> action.
>>>>>
>>>>> Failure detection, re-launch, etc. have all been taken care of for you.
>>>>>
>>>>
>>>>
>>>> I looked at the code that you mentioned me and i realize that i have
>>>> two possible options, that i'm going to share with you to know your opinion.
>>>>
>>>> First of all i will let you know my actual situation with the
>>>> implementation. As i'm working in a Fault Tolerant system, but using
>>>> uncoordinated checkpoint i'm taking checkpoints of all my process at
>>>> different time and storing them on the machine where there are residing, but
>>>> i also send this checkpoints to another node (lets call it protector), so if
>>>> this node fails his process should be restarted in the protector that have
>>>> his checkpoints.
>>>>
>>>> Right now i'm detecting the failure of a process and i know where this
>>>> process should be restarted, and also i have the checkpoint in the
>>>> protector. And i also have the child information of course.
>>>>
>>>> So, my options are:
>>>> *First Option*
>>>> *
>>>> *
>>>> I detect the failure, and then i use
>>>> orte_errmgr_hnp_base_global_update_state() with some modifications and the
>>>> hnp_relocate but changing the spawning to make a restart from a checkpoint,
>>>> i suposse that using this, the migration of the process to another node will
>>>> be updated and everyone will know it, because is the hnp who is going to do
>>>> this (is this ok?).
>>>>
>>>>
>>>> This is the option I would use. The other one is much, much more work.
>>>> In this option, you only have to:
>>>>
>>>> (a) modify the mapper so you can specify the location of the proc being
>>>> restarted. The resilient mapper module will be handling the restart - if you
>>>> look at orte/mca/rmaps/resilient/rmaps_resilient.c, you can see the code
>>>> doing the "replacement" and modify accordingly.
>>>>
>>>> (b) add any required info about your checkpoint to the launch message.
>>>> This gets created in orte/mca/odls/base/odls_base_default_fns.c, the
>>>> "get_add_procs_data" function (at the top of the file).
>>>>
>>>> (c) modify the launch code to handle your checkpoint, if required - see
>>>> the file in (b), the "construct_child" and "launch" functions.
>>>>
>>>> HTH
>>>> Ralph
>>>>
>>>>
>>>>
>>>> *Second Option*
>>>> *
>>>> *
>>>> Modify one of the spawn variations(probably the remote_spawn from rsh) in
>>>> the PLM framework and then use the orted_comm to command a remote_spawn in
>>>> the protector, but i don't know here how to update the info so everyone
>>>> knows about the change or how this is managed.
>>>>
>>>> I might be very wrong in what I said, my apologies if so.
>>>>
>>>> Thanks a lot for all the help.
>>>>
>>>> Best regards.
>>>>
>>>> Hugo Meyer
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>
>>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
>