Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Add child to another parent.
From: Hugo Meyer (meyer.hugo_at_[hidden])
Date: 2011-04-08 11:02:34


Thanks Ralph.

I found a set_lifeline with that i think i solve that error, but, now i'm
dealing with another.

[clus3:32001] [[44269,0],2] -> [[44269,1],1] (node: node3) oob-tcp: Number
of attempts to create TCP connection has been exceeded. Can not communicate
with peer
Open MPI Error Report:[32001]: While communicating to proc [[44269,1],1] on
node node3, proc [[44269,0],2] on node clus3 encountered an error
'Communication failure':OOB Connection retries exceeded. Can not
communicate with peer

I think that this occurs because the daemon [[44269,0],2] doesn't know in
wich port and address has been restored the proc. I will look for a way to
update this information.

Best regards.

Hugo

2011/4/6 Ralph Castain <rhc_at_[hidden]>

> Looks like the lifeline is still pointing to its old daemon instead of
> being updated to the new one. Look in orte/mca/routed/cm/routed_cm.c -
> should be something in there that updates the lifeline during restart of a
> checkpoint.
>
>
> On Apr 6, 2011, at 7:50 AM, Hugo Meyer wrote:
>
> Hi all.
>
>
> I corrected the error with the port. The mistake was because he tried to
> start theprocess back and the ports are static, the process was taking a port
> where an app was already running.
>
> Initially, the process was running on [[65478,0],1] and then it moves
> to [[65478,0],2].
>
> So now i get the socket binded, but i'm getting a communication failure
> in [[65478,0],1]. I'm sending as an atachment my debug output (there are
> some things in spanish, but there still are the open-mpi default debug
> output), where you can see the moment where i kill the process running con
> *clus5 *to the moment where it is restored in *clus3. *And then i get
> a TERMINATED WITHOUT SYNC in the proc restarted:
>
> *clus3:21615] [[65478,0],2] errmgr:orted got state TERMINATED WITHOUT SYNC
> for proc [[65478,1],1] pid 21705*
>
> *
> *
> Here i put the output of my stdout after the socket is binded again when
> the process restarts.
>
>
> [1,1]<stdout>:SOCKET BINDED
> [1,1]<stdout>:[clus5:19425] App) notify_response: Waiting for final
> handshake.
> [1,1]<stdout>:[clus5:19425] App) update_status: Update checkpoint status
> (13, /tmp/radic/1) for [[65478,1],1]
> [1,0]<stdout>:INICIEI O BROADCAST (6)
> [1,0]<stdout>:FINALIZEI O BROADCAST (6)
> [1,0]<stdout>:INICIEI O BROADCAST
> [1,3]<stdout>:INICIEI O BROADCAST (6)
> [1,2]<stdout>:INICIEI O BROADCAST (6)
> [1,3]<stdout>:FINALIZEI O BROADCAST (6)
> [1,3]<stdout>:INICIEI O BROADCAST
> [1,2]<stdout>:FINALIZEI O BROADCAST (6)
> [1,2]<stdout>:INICIEI O BROADCAST
> [1,1]<stdout>:[clus5:19425] [[65478,1],1] errmgr:app: job [65478,0]
> reported state COMMUNICATION FAILURE for proc [[65478,0],1] state
> COMMUNICATION FAILURE exit_code 1
> [1,1]<stdout>:[clus5:19425] [[65478,1],1] routed:cm: Connection to lifeline
> [[65478,0],1] lost
> [1,1]<stdout>:[[65478,1],1] assigned port 31256
>
> Any help on how to solve this error, or how to interpret it will be greatly
> appreciated.
>
> Best regards.
>
> Hugo
>
> 2011/4/5 Hugo Meyer <meyer.hugo_at_[hidden]>
>
>> Hello Ralph and @ll.
>>
>> Ralph, by following your recomendations i've already restart the process
>> in another node from his checkpoint. But now i'm having a small problem with
>> the oob_tcp. There is the output:
>>
>> odls_base_default_fns:SETEANDO BLCR CONTEXT
>> CKPT-FILE: /tmp/radic/1/ompi_blcr_context.13374
>> ODLS_BASE_DEFAULT_FNS: REINICIO PROCESO EN [[34224,0],2]
>> [1,1]<stdout>:INICIEI O BROADCAST (2)
>> [1,1]<stdout>:[clus5:13374] snapc:single:app do_checkpoint: RESTART (3)
>> *[1,1]<stdout>:[clus5:13374] mca_oob_tcp_init: creating listen socket*
>> *[1,1]<stdout>:[clus5:13374] mca_oob_tcp_init: unable to create IPv4
>> listen socket: Unable to open a TCP socket for out-of-band communications
>> *
>> [1,1]<stdout>:[clus5:13374] App) notify_response: Waiting for final
>> handshake*.*
>> [1,1]<stdout>:[clus5:13374] App) update_status: Update checkpoint status
>> (13, /tmp/radic/1) for [[34224,1],1]
>> [1,0]<stdout>:INICIEI O BROADCAST (6)
>> [1,0]<stdout>:FINALIZEI O BROADCAST (6)
>> [1,0]<stdout>:INICIEI O BROADCAST
>> [1,3]<stdout>:INICIEI O BROADCAST (6)
>> [1,3]<stdout>:FINALIZEI O BROADCAST (6)
>> [1,3]<stdout>:INICIEI O BROADCAST
>> *[1,1]<stdout>:[clus5:13374] [[34224,1],1] errmgr:app: job [34224,0]
>> reported state COMMUNICATION FAILURE for proc [[34224,0],1] state
>> COMMUNICATION FAILURE exit_code 1*
>> *[1,1]<stdout>:[clus5:13374] [[34224,1],1] routed:cm: Connection to
>> lifeline [[34224,0],1] lost*
>>
>>
>> I'm thinking that this error ocurrs because the process want to create
>> the socket using the port that was previously assigned to it. So, if i
>> want to restart it using another port or something how the other daemons and
>> process will find out about this? Is this a good choice?
>>
>> Best regards.
>>
>> Hugo Meyer
>>
>> 2011/3/31 Hugo Meyer <meyer.hugo_at_[hidden]>
>>
>>> Ok Ralph.
>>> Thanks a lot, i will resend this message with a new subject.
>>>
>>> Best Regards.
>>>
>>> Hugo
>>>
>>>
>>> 2011/3/31 Ralph Castain <rhc_at_[hidden]>
>>>
>>>> Sorry - should have included the devel list when I sent this.
>>>>
>>>>
>>>> On Mar 30, 2011, at 6:11 PM, Ralph Castain wrote:
>>>>
>>>> I'm not the expert on this area - Josh is, so I'll defer to him. I did
>>>> take a quick glance at the sstore framework, though, and it looks like there
>>>> are some params you could set that might help.
>>>>
>>>> "ompi_info --param sstore all"
>>>>
>>>> should tell you what's available. Also, note that Josh created a man
>>>> page to explain how sstore works. It's in section 7, looks like "man
>>>> orte_sstore" should get it.
>>>>
>>>>
>>>> On Mar 30, 2011, at 3:09 PM, Hugo Meyer wrote:
>>>>
>>>> Hello again.
>>>>
>>>> I'm working in the launch code to handle my checkpoints, but i'm a
>>>> little stuck in how to set the path to my checkpoint and the executable
>>>> (ompi_blcr_context.PID). I take a look at the code in
>>>> odls_base_default_fns.c and this piece of code took my attention:
>>>>
>>>> #if OPAL_ENABLE_FT_CR == 1
>>>> /*
>>>> * OPAL CRS components need the opportunity to take action
>>>> before a process
>>>> * is forked.
>>>> * Needs access to:
>>>> * - Environment
>>>> * - Rank/ORTE Name
>>>> * - Binary to exec
>>>> */
>>>> if( NULL != opal_crs.crs_prelaunch ) {
>>>> if( OPAL_SUCCESS != (rc =
>>>> opal_crs.crs_prelaunch(child->name->vpid,
>>>>
>>>> orte_sstore_base_prelaunch_location,
>>>>
>>>> &(app->app),
>>>>
>>>> &(app->cwd),
>>>>
>>>> &(app->argv),
>>>>
>>>> &(app->env) ) ) ) {
>>>> ORTE_ERROR_LOG(rc);
>>>> goto CLEANUP;
>>>> }
>>>> }
>>>> #endif
>>>>
>>>>
>>>> But i didn't find out how to set orte_sstore_base_prelaunch_location, i
>>>> now that initially this is set in the sstore_base_open. For example, as i'm
>>>> transfering my checkpoint from one node to another, i store the checkpoint
>>>> that has to be restore in /tmp/1/ and it has a name
>>>> like ompi_blcr_context.PID.
>>>>
>>>> Is there any function that i didn't see that allows me to do this? I'm
>>>> asking this because I do not want to change the signature of the
>>>> functions to pass the details of the checkpoint and the PID.
>>>>
>>>> Best Regards.
>>>>
>>>> Hugo Meyer
>>>>
>>>> 2011/3/30 Hugo Meyer <meyer.hugo_at_[hidden]>
>>>>
>>>>> Thanks Ralph.
>>>>> I have finished the (a) point, and now its working, now i have to work
>>>>> to relaunch from my checkpoint as you said.
>>>>>
>>>>> Best regards.
>>>>>
>>>>> Hugo Meyer
>>>>>
>>>>>
>>>>> 2011/3/29 Ralph Castain <rhc_at_[hidden]>
>>>>>
>>>>>> The resilient mapper -only- works on procs being restarted - it
>>>>>> cannot map a job for its initial launch. You shouldn't set any rmaps flag
>>>>>> and things will work correctly - the default round-robin mapper will map the
>>>>>> initial launch, and then the resilient mapper will handle restarts.
>>>>>>
>>>>>>
>>>>>> On Mar 29, 2011, at 5:18 AM, Hugo Meyer wrote:
>>>>>>
>>>>>> Ralph.
>>>>>>
>>>>>> I'm having a problem when i try to select the rmaps resilient to be
>>>>>> used:
>>>>>>
>>>>>> /home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun -v -np 4
>>>>>> --hostfile ../hostfile --bynode -mca rmaps resilient -mca vprotocol receiver
>>>>>> -mca plm rsh -mca routed cm ./coll 6 10 2>out.txt
>>>>>>
>>>>>>
>>>>>> I get this as error:
>>>>>>
>>>>>> [clus9:25568] [[53334,0],0] hostfile: checking hostfile ../hostfile
>>>>>> for nodes
>>>>>>
>>>>>> --------------------------------------------------------------------------
>>>>>> Your job failed to map. Either no mapper was available, or none
>>>>>> of the available mappers was able to perform the requested
>>>>>> mapping operation. This can happen if you request a map type
>>>>>> (e.g., loadbalance) and the corresponding mapper was not built.
>>>>>>
>>>>>>
>>>>>> --------------------------------------------------------------------------
>>>>>> [clus9:25568] errmgr:hnp:update_state() [[53334,0],0]) ------- App.
>>>>>> Process state updated for process NULL
>>>>>> [clus9:25568] [[53334,0],0] errmgr:hnp: job [53334,0] reported state
>>>>>> NEVER LAUNCHED for proc NULL state UNDEFINED pid 0 exit_code 1
>>>>>> [clus9:25568] [[53334,0],0] errmgr:hnp: job [53334,0] reported state
>>>>>> NEVER LAUNCHED
>>>>>> [clus9:25568] [[53334,0],0] errmgr:hnp: abort called on job [53334,0]
>>>>>> with status 1
>>>>>>
>>>>>>
>>>>>> Is there a flag that i'm not turning on? or a component that i should
>>>>>> have selected?
>>>>>>
>>>>>> Thanks again.
>>>>>>
>>>>>> Hugo Meyer
>>>>>>
>>>>>>
>>>>>> 2011/3/26 Hugo Meyer <meyer.hugo_at_[hidden]>
>>>>>>
>>>>>>> Ok Ralph.
>>>>>>>
>>>>>>> Thanks a lot for your help, i will do as you said and then let you
>>>>>>> know how it goes.
>>>>>>>
>>>>>>> Best Regards.
>>>>>>>
>>>>>>> Hugo Meyer
>>>>>>>
>>>>>>>
>>>>>>> 2011/3/25 Ralph Castain <rhc_at_[hidden]>
>>>>>>>
>>>>>>>>
>>>>>>>> On Mar 25, 2011, at 10:48 AM, Hugo Meyer wrote:
>>>>>>>>
>>>>>>>> From what you've described before, I suspect all you'll need to do
>>>>>>>>> is add some code in orte/mca/odls/base/odls_base_default_fns.c that (a)
>>>>>>>>> checks to see if a process in the launch message is being relocated (the
>>>>>>>>> construct_child_list code does that already), and then (b) sends the
>>>>>>>>> required info to all local child processes so they can take appropriate
>>>>>>>>> action.
>>>>>>>>>
>>>>>>>>> Failure detection, re-launch, etc. have all been taken care of for
>>>>>>>>> you.
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I looked at the code that you mentioned me and i realize that i
>>>>>>>> have two possible options, that i'm going to share with you to know your
>>>>>>>> opinion.
>>>>>>>>
>>>>>>>> First of all i will let you know my actual situation with the
>>>>>>>> implementation. As i'm working in a Fault Tolerant system, but using
>>>>>>>> uncoordinated checkpoint i'm taking checkpoints of all my process at
>>>>>>>> different time and storing them on the machine where there are residing, but
>>>>>>>> i also send this checkpoints to another node (lets call it protector), so if
>>>>>>>> this node fails his process should be restarted in the protector that have
>>>>>>>> his checkpoints.
>>>>>>>>
>>>>>>>> Right now i'm detecting the failure of a process and i know where
>>>>>>>> this process should be restarted, and also i have the checkpoint in the
>>>>>>>> protector. And i also have the child information of course.
>>>>>>>>
>>>>>>>> So, my options are:
>>>>>>>> *First Option*
>>>>>>>> *
>>>>>>>> *
>>>>>>>> I detect the failure, and then i use
>>>>>>>> orte_errmgr_hnp_base_global_update_state() with some modifications and the
>>>>>>>> hnp_relocate but changing the spawning to make a restart from a checkpoint,
>>>>>>>> i suposse that using this, the migration of the process to another node will
>>>>>>>> be updated and everyone will know it, because is the hnp who is going to do
>>>>>>>> this (is this ok?).
>>>>>>>>
>>>>>>>>
>>>>>>>> This is the option I would use. The other one is much, much more
>>>>>>>> work. In this option, you only have to:
>>>>>>>>
>>>>>>>> (a) modify the mapper so you can specify the location of the proc
>>>>>>>> being restarted. The resilient mapper module will be handling the restart -
>>>>>>>> if you look at orte/mca/rmaps/resilient/rmaps_resilient.c, you can see the
>>>>>>>> code doing the "replacement" and modify accordingly.
>>>>>>>>
>>>>>>>> (b) add any required info about your checkpoint to the launch
>>>>>>>> message. This gets created in orte/mca/odls/base/odls_base_default_fns.c,
>>>>>>>> the "get_add_procs_data" function (at the top of the file).
>>>>>>>>
>>>>>>>> (c) modify the launch code to handle your checkpoint, if required -
>>>>>>>> see the file in (b), the "construct_child" and "launch" functions.
>>>>>>>>
>>>>>>>> HTH
>>>>>>>> Ralph
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *Second Option*
>>>>>>>> *
>>>>>>>> *
>>>>>>>> Modify one of the spawn variations(probably the remote_spawn from
>>>>>>>> rsh) in the PLM framework and then use the orted_comm to command a
>>>>>>>> remote_spawn in the protector, but i don't know here how to update the info
>>>>>>>> so everyone knows about the change or how this is managed.
>>>>>>>>
>>>>>>>> I might be very wrong in what I said, my apologies if so.
>>>>>>>>
>>>>>>>> Thanks a lot for all the help.
>>>>>>>>
>>>>>>>> Best regards.
>>>>>>>>
>>>>>>>> Hugo Meyer
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>> devel_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>> devel_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> devel_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> devel_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>
> <out>
>
>
>