Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Add child to another parent.
From: Hugo Meyer (meyer.hugo_at_[hidden])
Date: 2011-04-05 06:33:28


Hello Ralph and @ll.

Ralph, by following your recomendations i've already restart the process in
another node from his checkpoint. But now i'm having a small problem with
the oob_tcp. There is the output:

odls_base_default_fns:SETEANDO BLCR CONTEXT
CKPT-FILE: /tmp/radic/1/ompi_blcr_context.13374
ODLS_BASE_DEFAULT_FNS: REINICIO PROCESO EN [[34224,0],2]
[1,1]<stdout>:INICIEI O BROADCAST (2)
[1,1]<stdout>:[clus5:13374] snapc:single:app do_checkpoint: RESTART (3)
*[1,1]<stdout>:[clus5:13374] mca_oob_tcp_init: creating listen socket*
*[1,1]<stdout>:[clus5:13374] mca_oob_tcp_init: unable to create IPv4 listen
socket: Unable to open a TCP socket for out-of-band communications*
[1,1]<stdout>:[clus5:13374] App) notify_response: Waiting for final
handshake*.*
[1,1]<stdout>:[clus5:13374] App) update_status: Update checkpoint status
(13, /tmp/radic/1) for [[34224,1],1]
[1,0]<stdout>:INICIEI O BROADCAST (6)
[1,0]<stdout>:FINALIZEI O BROADCAST (6)
[1,0]<stdout>:INICIEI O BROADCAST
[1,3]<stdout>:INICIEI O BROADCAST (6)
[1,3]<stdout>:FINALIZEI O BROADCAST (6)
[1,3]<stdout>:INICIEI O BROADCAST
*[1,1]<stdout>:[clus5:13374] [[34224,1],1] errmgr:app: job [34224,0]
reported state COMMUNICATION FAILURE for proc [[34224,0],1] state
COMMUNICATION FAILURE exit_code 1*
*[1,1]<stdout>:[clus5:13374] [[34224,1],1] routed:cm: Connection to lifeline
[[34224,0],1] lost*

I'm thinking that this error ocurrs because the process want to create the
socket using the port that was previously assigned to it. So, if i want to
restart it using another port or something how the other daemons and process
will find out about this? Is this a good choice?

Best regards.

Hugo Meyer

2011/3/31 Hugo Meyer <meyer.hugo_at_[hidden]>

> Ok Ralph.
> Thanks a lot, i will resend this message with a new subject.
>
> Best Regards.
>
> Hugo
>
>
> 2011/3/31 Ralph Castain <rhc_at_[hidden]>
>
>> Sorry - should have included the devel list when I sent this.
>>
>>
>> On Mar 30, 2011, at 6:11 PM, Ralph Castain wrote:
>>
>> I'm not the expert on this area - Josh is, so I'll defer to him. I did
>> take a quick glance at the sstore framework, though, and it looks like there
>> are some params you could set that might help.
>>
>> "ompi_info --param sstore all"
>>
>> should tell you what's available. Also, note that Josh created a man page
>> to explain how sstore works. It's in section 7, looks like "man orte_sstore"
>> should get it.
>>
>>
>> On Mar 30, 2011, at 3:09 PM, Hugo Meyer wrote:
>>
>> Hello again.
>>
>> I'm working in the launch code to handle my checkpoints, but i'm a little
>> stuck in how to set the path to my checkpoint and the executable
>> (ompi_blcr_context.PID). I take a look at the code in
>> odls_base_default_fns.c and this piece of code took my attention:
>>
>> #if OPAL_ENABLE_FT_CR == 1
>> /*
>> * OPAL CRS components need the opportunity to take action
>> before a process
>> * is forked.
>> * Needs access to:
>> * - Environment
>> * - Rank/ORTE Name
>> * - Binary to exec
>> */
>> if( NULL != opal_crs.crs_prelaunch ) {
>> if( OPAL_SUCCESS != (rc =
>> opal_crs.crs_prelaunch(child->name->vpid,
>>
>> orte_sstore_base_prelaunch_location,
>>
>> &(app->app),
>>
>> &(app->cwd),
>>
>> &(app->argv),
>>
>> &(app->env) ) ) ) {
>> ORTE_ERROR_LOG(rc);
>> goto CLEANUP;
>> }
>> }
>> #endif
>>
>>
>> But i didn't find out how to set orte_sstore_base_prelaunch_location, i
>> now that initially this is set in the sstore_base_open. For example, as i'm
>> transfering my checkpoint from one node to another, i store the checkpoint
>> that has to be restore in /tmp/1/ and it has a name
>> like ompi_blcr_context.PID.
>>
>> Is there any function that i didn't see that allows me to do this? I'm
>> asking this because I do not want to change the signature of the
>> functions to pass the details of the checkpoint and the PID.
>>
>> Best Regards.
>>
>> Hugo Meyer
>>
>> 2011/3/30 Hugo Meyer <meyer.hugo_at_[hidden]>
>>
>>> Thanks Ralph.
>>> I have finished the (a) point, and now its working, now i have to work to
>>> relaunch from my checkpoint as you said.
>>>
>>> Best regards.
>>>
>>> Hugo Meyer
>>>
>>>
>>> 2011/3/29 Ralph Castain <rhc_at_[hidden]>
>>>
>>>> The resilient mapper -only- works on procs being restarted - it cannot
>>>> map a job for its initial launch. You shouldn't set any rmaps flag and
>>>> things will work correctly - the default round-robin mapper will map the
>>>> initial launch, and then the resilient mapper will handle restarts.
>>>>
>>>>
>>>> On Mar 29, 2011, at 5:18 AM, Hugo Meyer wrote:
>>>>
>>>> Ralph.
>>>>
>>>> I'm having a problem when i try to select the rmaps resilient to be
>>>> used:
>>>>
>>>> /home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun -v -np 4
>>>> --hostfile ../hostfile --bynode -mca rmaps resilient -mca vprotocol receiver
>>>> -mca plm rsh -mca routed cm ./coll 6 10 2>out.txt
>>>>
>>>>
>>>> I get this as error:
>>>>
>>>> [clus9:25568] [[53334,0],0] hostfile: checking hostfile ../hostfile for
>>>> nodes
>>>>
>>>> --------------------------------------------------------------------------
>>>> Your job failed to map. Either no mapper was available, or none
>>>> of the available mappers was able to perform the requested
>>>> mapping operation. This can happen if you request a map type
>>>> (e.g., loadbalance) and the corresponding mapper was not built.
>>>>
>>>>
>>>> --------------------------------------------------------------------------
>>>> [clus9:25568] errmgr:hnp:update_state() [[53334,0],0]) ------- App.
>>>> Process state updated for process NULL
>>>> [clus9:25568] [[53334,0],0] errmgr:hnp: job [53334,0] reported state
>>>> NEVER LAUNCHED for proc NULL state UNDEFINED pid 0 exit_code 1
>>>> [clus9:25568] [[53334,0],0] errmgr:hnp: job [53334,0] reported state
>>>> NEVER LAUNCHED
>>>> [clus9:25568] [[53334,0],0] errmgr:hnp: abort called on job [53334,0]
>>>> with status 1
>>>>
>>>>
>>>> Is there a flag that i'm not turning on? or a component that i should
>>>> have selected?
>>>>
>>>> Thanks again.
>>>>
>>>> Hugo Meyer
>>>>
>>>>
>>>> 2011/3/26 Hugo Meyer <meyer.hugo_at_[hidden]>
>>>>
>>>>> Ok Ralph.
>>>>>
>>>>> Thanks a lot for your help, i will do as you said and then let you know
>>>>> how it goes.
>>>>>
>>>>> Best Regards.
>>>>>
>>>>> Hugo Meyer
>>>>>
>>>>>
>>>>> 2011/3/25 Ralph Castain <rhc_at_[hidden]>
>>>>>
>>>>>>
>>>>>> On Mar 25, 2011, at 10:48 AM, Hugo Meyer wrote:
>>>>>>
>>>>>> From what you've described before, I suspect all you'll need to do is
>>>>>>> add some code in orte/mca/odls/base/odls_base_default_fns.c that (a) checks
>>>>>>> to see if a process in the launch message is being relocated (the
>>>>>>> construct_child_list code does that already), and then (b) sends the
>>>>>>> required info to all local child processes so they can take appropriate
>>>>>>> action.
>>>>>>>
>>>>>>> Failure detection, re-launch, etc. have all been taken care of for
>>>>>>> you.
>>>>>>>
>>>>>>
>>>>>>
>>>>>> I looked at the code that you mentioned me and i realize that i have
>>>>>> two possible options, that i'm going to share with you to know your opinion.
>>>>>>
>>>>>> First of all i will let you know my actual situation with the
>>>>>> implementation. As i'm working in a Fault Tolerant system, but using
>>>>>> uncoordinated checkpoint i'm taking checkpoints of all my process at
>>>>>> different time and storing them on the machine where there are residing, but
>>>>>> i also send this checkpoints to another node (lets call it protector), so if
>>>>>> this node fails his process should be restarted in the protector that have
>>>>>> his checkpoints.
>>>>>>
>>>>>> Right now i'm detecting the failure of a process and i know where this
>>>>>> process should be restarted, and also i have the checkpoint in the
>>>>>> protector. And i also have the child information of course.
>>>>>>
>>>>>> So, my options are:
>>>>>> *First Option*
>>>>>> *
>>>>>> *
>>>>>> I detect the failure, and then i use
>>>>>> orte_errmgr_hnp_base_global_update_state() with some modifications and the
>>>>>> hnp_relocate but changing the spawning to make a restart from a checkpoint,
>>>>>> i suposse that using this, the migration of the process to another node will
>>>>>> be updated and everyone will know it, because is the hnp who is going to do
>>>>>> this (is this ok?).
>>>>>>
>>>>>>
>>>>>> This is the option I would use. The other one is much, much more work.
>>>>>> In this option, you only have to:
>>>>>>
>>>>>> (a) modify the mapper so you can specify the location of the proc
>>>>>> being restarted. The resilient mapper module will be handling the restart -
>>>>>> if you look at orte/mca/rmaps/resilient/rmaps_resilient.c, you can see the
>>>>>> code doing the "replacement" and modify accordingly.
>>>>>>
>>>>>> (b) add any required info about your checkpoint to the launch message.
>>>>>> This gets created in orte/mca/odls/base/odls_base_default_fns.c, the
>>>>>> "get_add_procs_data" function (at the top of the file).
>>>>>>
>>>>>> (c) modify the launch code to handle your checkpoint, if required -
>>>>>> see the file in (b), the "construct_child" and "launch" functions.
>>>>>>
>>>>>> HTH
>>>>>> Ralph
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Second Option*
>>>>>> *
>>>>>> *
>>>>>> Modify one of the spawn variations(probably the remote_spawn from
>>>>>> rsh) in the PLM framework and then use the orted_comm to command a
>>>>>> remote_spawn in the protector, but i don't know here how to update the info
>>>>>> so everyone knows about the change or how this is managed.
>>>>>>
>>>>>> I might be very wrong in what I said, my apologies if so.
>>>>>>
>>>>>> Thanks a lot for all the help.
>>>>>>
>>>>>> Best regards.
>>>>>>
>>>>>> Hugo Meyer
>>>>>>
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> devel_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> devel_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>
>>>>>
>>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>
>>>
>>
>>
>>
>