Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI devel] Add child to another parent.
From: Hugo Meyer (meyer.hugo_at_[hidden])
Date: 2011-04-08 11:02:34


Thanks Ralph.

I found a set_lifeline with that i think i solve that error, but, now i'm
dealing with another.

[clus3:32001] [[44269,0],2] -> [[44269,1],1] (node: node3) oob-tcp: Number
of attempts to create TCP connection has been exceeded. Can not communicate
with peer
Open MPI Error Report:[32001]: While communicating to proc [[44269,1],1] on
node node3, proc [[44269,0],2] on node clus3 encountered an error
'Communication failure':OOB Connection retries exceeded. Can not
communicate with peer

I think that this occurs because the daemon [[44269,0],2] doesn't know in
wich port and address has been restored the proc. I will look for a way to
update this information.

Best regards.

Hugo

2011/4/6 Ralph Castain <rhc_at_[hidden]>

> Looks like the lifeline is still pointing to its old daemon instead of
> being updated to the new one. Look in orte/mca/routed/cm/routed_cm.c -
> should be something in there that updates the lifeline during restart of a
> checkpoint.
>
>
> On Apr 6, 2011, at 7:50 AM, Hugo Meyer wrote:
>
> Hi all.
>
>
> I corrected the error with the port. The mistake was because he tried to
> start theprocess back and the ports are static, the process was taking a port
> where an app was already running.
>
> Initially, the process was running on [[65478,0],1] and then it moves
> to [[65478,0],2].
>
> So now i get the socket binded, but i'm getting a communication failure
> in [[65478,0],1]. I'm sending as an atachment my debug output (there are
> some things in spanish, but there still are the open-mpi default debug
> output), where you can see the moment where i kill the process running con
> *clus5 *to the moment where it is restored in *clus3. *And then i get
> a TERMINATED WITHOUT SYNC in the proc restarted:
>
> *clus3:21615] [[65478,0],2] errmgr:orted got state TERMINATED WITHOUT SYNC
> for proc [[65478,1],1] pid 21705*
>
> *
> *
> Here i put the output of my stdout after the socket is binded again when
> the process restarts.
>
>
> [1,1]<stdout>:SOCKET BINDED
> [1,1]<stdout>:[clus5:19425] App) notify_response: Waiting for final
> handshake.
> [1,1]<stdout>:[clus5:19425] App) update_status: Update checkpoint status
> (13, /tmp/radic/1) for [[65478,1],1]
> [1,0]<stdout>:INICIEI O BROADCAST (6)
> [1,0]<stdout>:FINALIZEI O BROADCAST (6)
> [1,0]<stdout>:INICIEI O BROADCAST
> [1,3]<stdout>:INICIEI O BROADCAST (6)
> [1,2]<stdout>:INICIEI O BROADCAST (6)
> [1,3]<stdout>:FINALIZEI O BROADCAST (6)
> [1,3]<stdout>:INICIEI O BROADCAST
> [1,2]<stdout>:FINALIZEI O BROADCAST (6)
> [1,2]<stdout>:INICIEI O BROADCAST
> [1,1]<stdout>:[clus5:19425] [[65478,1],1] errmgr:app: job [65478,0]
> reported state COMMUNICATION FAILURE for proc [[65478,0],1] state
> COMMUNICATION FAILURE exit_code 1
> [1,1]<stdout>:[clus5:19425] [[65478,1],1] routed:cm: Connection to lifeline
> [[65478,0],1] lost
> [1,1]<stdout>:[[65478,1],1] assigned port 31256
>
> Any help on how to solve this error, or how to interpret it will be greatly
> appreciated.
>
> Best regards.
>
> Hugo
>
> 2011/4/5 Hugo Meyer <meyer.hugo_at_[hidden]>
>
>> Hello Ralph and @ll.
>>
>> Ralph, by following your recomendations i've already restart the process
>> in another node from his checkpoint. But now i'm having a small problem with
>> the oob_tcp. There is the output:
>>
>> odls_base_default_fns:SETEANDO BLCR CONTEXT
>> CKPT-FILE: /tmp/radic/1/ompi_blcr_context.13374
>> ODLS_BASE_DEFAULT_FNS: REINICIO PROCESO EN [[34224,0],2]
>> [1,1]<stdout>:INICIEI O BROADCAST (2)
>> [1,1]<stdout>:[clus5:13374] snapc:single:app do_checkpoint: RESTART (3)
>> *[1,1]<stdout>:[clus5:13374] mca_oob_tcp_init: creating listen socket*
>> *[1,1]<stdout>:[clus5:13374] mca_oob_tcp_init: unable to create IPv4
>> listen socket: Unable to open a TCP socket for out-of-band communications
>> *
>> [1,1]<stdout>:[clus5:13374] App) notify_response: Waiting for final
>> handshake*.*
>> [1,1]<stdout>:[clus5:13374] App) update_status: Update checkpoint status
>> (13, /tmp/radic/1) for [[34224,1],1]
>> [1,0]<stdout>:INICIEI O BROADCAST (6)
>> [1,0]<stdout>:FINALIZEI O BROADCAST (6)
>> [1,0]<stdout>:INICIEI O BROADCAST
>> [1,3]<stdout>:INICIEI O BROADCAST (6)
>> [1,3]<stdout>:FINALIZEI O BROADCAST (6)
>> [1,3]<stdout>:INICIEI O BROADCAST
>> *[1,1]<stdout>:[clus5:13374] [[34224,1],1] errmgr:app: job [34224,0]
>> reported state COMMUNICATION FAILURE for proc [[34224,0],1] state
>> COMMUNICATION FAILURE exit_code 1*
>> *[1,1]<stdout>:[clus5:13374] [[34224,1],1] routed:cm: Connection to
>> lifeline [[34224,0],1] lost*
>>
>>
>> I'm thinking that this error ocurrs because the process want to create
>> the socket using the port that was previously assigned to it. So, if i
>> want to restart it using another port or something how the other daemons and
>> process will find out about this? Is this a good choice?
>>
>> Best regards.
>>
>> Hugo Meyer
>>
>> 2011/3/31 Hugo Meyer <meyer.hugo_at_[hidden]>
>>
>>> Ok Ralph.
>>> Thanks a lot, i will resend this message with a new subject.
>>>
>>> Best Regards.
>>>
>>> Hugo
>>>
>>>
>>> 2011/3/31 Ralph Castain <rhc_at_[hidden]>
>>>
>>>> Sorry - should have included the devel list when I sent this.
>>>>
>>>>
>>>> On Mar 30, 2011, at 6:11 PM, Ralph Castain wrote:
>>>>
>>>> I'm not the expert on this area - Josh is, so I'll defer to him. I did
>>>> take a quick glance at the sstore framework, though, and it looks like there
>>>> are some params you could set that might help.
>>>>
>>>> "ompi_info --param sstore all"
>>>>
>>>> should tell you what's available. Also, note that Josh created a man
>>>> page to explain how sstore works. It's in section 7, looks like "man
>>>> orte_sstore" should get it.
>>>>
>>>>
>>>> On Mar 30, 2011, at 3:09 PM, Hugo Meyer wrote:
>>>>
>>>> Hello again.
>>>>
>>>> I'm working in the launch code to handle my checkpoints, but i'm a
>>>> little stuck in how to set the path to my checkpoint and the executable
>>>> (ompi_blcr_context.PID). I take a look at the code in
>>>> odls_base_default_fns.c and this piece of code took my attention:
>>>>
>>>> #if OPAL_ENABLE_FT_CR == 1
>>>> /*
>>>> * OPAL CRS components need the opportunity to take action
>>>> before a process
>>>> * is forked.
>>>> * Needs access to:
>>>> * - Environment
>>>> * - Rank/ORTE Name
>>>> * - Binary to exec
>>>> */
>>>> if( NULL != opal_crs.crs_prelaunch ) {
>>>> if( OPAL_SUCCESS != (rc =
>>>> opal_crs.crs_prelaunch(child->name->vpid,
>>>>
>>>> orte_sstore_base_prelaunch_location,
>>>>
>>>> &(app->app),
>>>>
>>>> &(app->cwd),
>>>>
>>>> &(app->argv),
>>>>
>>>> &(app->env) ) ) ) {
>>>> ORTE_ERROR_LOG(rc);
>>>> goto CLEANUP;
>>>> }
>>>> }
>>>> #endif
>>>>
>>>>
>>>> But i didn't find out how to set orte_sstore_base_prelaunch_location, i
>>>> now that initially this is set in the sstore_base_open. For example, as i'm
>>>> transfering my checkpoint from one node to another, i store the checkpoint
>>>> that has to be restore in /tmp/1/ and it has a name
>>>> like ompi_blcr_context.PID.
>>>>
>>>> Is there any function that i didn't see that allows me to do this? I'm
>>>> asking this because I do not want to change the signature of the
>>>> functions to pass the details of the checkpoint and the PID.
>>>>
>>>> Best Regards.
>>>>
>>>> Hugo Meyer
>>>>
>>>> 2011/3/30 Hugo Meyer <meyer.hugo_at_[hidden]>
>>>>
>>>>> Thanks Ralph.
>>>>> I have finished the (a) point, and now its working, now i have to work
>>>>> to relaunch from my checkpoint as you said.
>>>>>
>>>>> Best regards.
>>>>>
>>>>> Hugo Meyer
>>>>>
>>>>>
>>>>> 2011/3/29 Ralph Castain <rhc_at_[hidden]>
>>>>>
>>>>>> The resilient mapper -only- works on procs being restarted - it
>>>>>> cannot map a job for its initial launch. You shouldn't set any rmaps flag
>>>>>> and things will work correctly - the default round-robin mapper will map the
>>>>>> initial launch, and then the resilient mapper will handle restarts.
>>>>>>
>>>>>>
>>>>>> On Mar 29, 2011, at 5:18 AM, Hugo Meyer wrote:
>>>>>>
>>>>>> Ralph.
>>>>>>
>>>>>> I'm having a problem when i try to select the rmaps resilient to be
>>>>>> used:
>>>>>>
>>>>>> /home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun -v -np 4
>>>>>> --hostfile ../hostfile --bynode -mca rmaps resilient -mca vprotocol receiver
>>>>>> -mca plm rsh -mca routed cm ./coll 6 10 2>out.txt
>>>>>>
>>>>>>
>>>>>> I get this as error:
>>>>>>
>>>>>> [clus9:25568] [[53334,0],0] hostfile: checking hostfile ../hostfile
>>>>>> for nodes
>>>>>>
>>>>>> --------------------------------------------------------------------------
>>>>>> Your job failed to map. Either no mapper was available, or none
>>>>>> of the available mappers was able to perform the requested
>>>>>> mapping operation. This can happen if you request a map type
>>>>>> (e.g., loadbalance) and the corresponding mapper was not built.
>>>>>>
>>>>>>
>>>>>> --------------------------------------------------------------------------
>>>>>> [clus9:25568] errmgr:hnp:update_state() [[53334,0],0]) ------- App.
>>>>>> Process state updated for process NULL
>>>>>> [clus9:25568] [[53334,0],0] errmgr:hnp: job [53334,0] reported state
>>>>>> NEVER LAUNCHED for proc NULL state UNDEFINED pid 0 exit_code 1
>>>>>> [clus9:25568] [[53334,0],0] errmgr:hnp: job [53334,0] reported state
>>>>>> NEVER LAUNCHED
>>>>>> [clus9:25568] [[53334,0],0] errmgr:hnp: abort called on job [53334,0]
>>>>>> with status 1
>>>>>>
>>>>>>
>>>>>> Is there a flag that i'm not turning on? or a component that i should
>>>>>> have selected?
>>>>>>
>>>>>> Thanks again.
>>>>>>
>>>>>> Hugo Meyer
>>>>>>
>>>>>>
>>>>>> 2011/3/26 Hugo Meyer <meyer.hugo_at_[hidden]>
>>>>>>
>>>>>>> Ok Ralph.
>>>>>>>
>>>>>>> Thanks a lot for your help, i will do as you said and then let you
>>>>>>> know how it goes.
>>>>>>>
>>>>>>> Best Regards.
>>>>>>>
>>>>>>> Hugo Meyer
>>>>>>>
>>>>>>>
>>>>>>> 2011/3/25 Ralph Castain <rhc_at_[hidden]>
>>>>>>>
>>>>>>>>
>>>>>>>> On Mar 25, 2011, at 10:48 AM, Hugo Meyer wrote:
>>>>>>>>
>>>>>>>> From what you've described before, I suspect all you'll need to do
>>>>>>>>> is add some code in orte/mca/odls/base/odls_base_default_fns.c that (a)
>>>>>>>>> checks to see if a process in the launch message is being relocated (the
>>>>>>>>> construct_child_list code does that already), and then (b) sends the
>>>>>>>>> required info to all local child processes so they can take appropriate
>>>>>>>>> action.
>>>>>>>>>
>>>>>>>>> Failure detection, re-launch, etc. have all been taken care of for
>>>>>>>>> you.
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I looked at the code that you mentioned me and i realize that i
>>>>>>>> have two possible options, that i'm going to share with you to know your
>>>>>>>> opinion.
>>>>>>>>
>>>>>>>> First of all i will let you know my actual situation with the
>>>>>>>> implementation. As i'm working in a Fault Tolerant system, but using
>>>>>>>> uncoordinated checkpoint i'm taking checkpoints of all my process at
>>>>>>>> different time and storing them on the machine where there are residing, but
>>>>>>>> i also send this checkpoints to another node (lets call it protector), so if
>>>>>>>> this node fails his process should be restarted in the protector that have
>>>>>>>> his checkpoints.
>>>>>>>>
>>>>>>>> Right now i'm detecting the failure of a process and i know where
>>>>>>>> this process should be restarted, and also i have the checkpoint in the
>>>>>>>> protector. And i also have the child information of course.
>>>>>>>>
>>>>>>>> So, my options are:
>>>>>>>> *First Option*
>>>>>>>> *
>>>>>>>> *
>>>>>>>> I detect the failure, and then i use
>>>>>>>> orte_errmgr_hnp_base_global_update_state() with some modifications and the
>>>>>>>> hnp_relocate but changing the spawning to make a restart from a checkpoint,
>>>>>>>> i suposse that using this, the migration of the process to another node will
>>>>>>>> be updated and everyone will know it, because is the hnp who is going to do
>>>>>>>> this (is this ok?).
>>>>>>>>
>>>>>>>>
>>>>>>>> This is the option I would use. The other one is much, much more
>>>>>>>> work. In this option, you only have to:
>>>>>>>>
>>>>>>>> (a) modify the mapper so you can specify the location of the proc
>>>>>>>> being restarted. The resilient mapper module will be handling the restart -
>>>>>>>> if you look at orte/mca/rmaps/resilient/rmaps_resilient.c, you can see the
>>>>>>>> code doing the "replacement" and modify accordingly.
>>>>>>>>
>>>>>>>> (b) add any required info about your checkpoint to the launch
>>>>>>>> message. This gets created in orte/mca/odls/base/odls_base_default_fns.c,
>>>>>>>> the "get_add_procs_data" function (at the top of the file).
>>>>>>>>
>>>>>>>> (c) modify the launch code to handle your checkpoint, if required -
>>>>>>>> see the file in (b), the "construct_child" and "launch" functions.
>>>>>>>>
>>>>>>>> HTH
>>>>>>>> Ralph
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *Second Option*
>>>>>>>> *
>>>>>>>> *
>>>>>>>> Modify one of the spawn variations(probably the remote_spawn from
>>>>>>>> rsh) in the PLM framework and then use the orted_comm to command a
>>>>>>>> remote_spawn in the protector, but i don't know here how to update the info
>>>>>>>> so everyone knows about the change or how this is managed.
>>>>>>>>
>>>>>>>> I might be very wrong in what I said, my apologies if so.
>>>>>>>>
>>>>>>>> Thanks a lot for all the help.
>>>>>>>>
>>>>>>>> Best regards.
>>>>>>>>
>>>>>>>> Hugo Meyer
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>> devel_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>> devel_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> devel_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> devel_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>
> <out>
>
>
>