Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Add child to another parent.
From: Hugo Meyer (meyer.hugo_at_[hidden])
Date: 2011-04-06 09:50:52


Hi all.

I corrected the error with the port. The mistake was because he tried to
start theprocess back and the ports are static, the process was taking a port
where an app was already running.

Initially, the process was running on [[65478,0],1] and then it moves to
[[65478,0],2].

So now i get the socket binded, but i'm getting a communication
failure in [[65478,0],1].
I'm sending as an atachment my debug output (there are some things in
spanish, but there still are the open-mpi default debug output), where you
can see the moment where i kill the process running con *clus5 *to the
moment where it is restored in *clus3. *And then i get a TERMINATED WITHOUT
SYNC in the proc restarted:

*clus3:21615] [[65478,0],2] errmgr:orted got state TERMINATED WITHOUT SYNC
for proc [[65478,1],1] pid 21705*

*
*
Here i put the output of my stdout after the socket is binded again when the
process restarts.

[1,1]<stdout>:SOCKET BINDED
[1,1]<stdout>:[clus5:19425] App) notify_response: Waiting for final
handshake.
[1,1]<stdout>:[clus5:19425] App) update_status: Update checkpoint status
(13, /tmp/radic/1) for [[65478,1],1]
[1,0]<stdout>:INICIEI O BROADCAST (6)
[1,0]<stdout>:FINALIZEI O BROADCAST (6)
[1,0]<stdout>:INICIEI O BROADCAST
[1,3]<stdout>:INICIEI O BROADCAST (6)
[1,2]<stdout>:INICIEI O BROADCAST (6)
[1,3]<stdout>:FINALIZEI O BROADCAST (6)
[1,3]<stdout>:INICIEI O BROADCAST
[1,2]<stdout>:FINALIZEI O BROADCAST (6)
[1,2]<stdout>:INICIEI O BROADCAST
[1,1]<stdout>:[clus5:19425] [[65478,1],1] errmgr:app: job [65478,0] reported
state COMMUNICATION FAILURE for proc [[65478,0],1] state COMMUNICATION
FAILURE exit_code 1
[1,1]<stdout>:[clus5:19425] [[65478,1],1] routed:cm: Connection to lifeline
[[65478,0],1] lost
[1,1]<stdout>:[[65478,1],1] assigned port 31256

Any help on how to solve this error, or how to interpret it will be greatly
appreciated.

Best regards.

Hugo

2011/4/5 Hugo Meyer <meyer.hugo_at_[hidden]>

> Hello Ralph and @ll.
>
> Ralph, by following your recomendations i've already restart the process in
> another node from his checkpoint. But now i'm having a small problem with
> the oob_tcp. There is the output:
>
> odls_base_default_fns:SETEANDO BLCR CONTEXT
> CKPT-FILE: /tmp/radic/1/ompi_blcr_context.13374
> ODLS_BASE_DEFAULT_FNS: REINICIO PROCESO EN [[34224,0],2]
> [1,1]<stdout>:INICIEI O BROADCAST (2)
> [1,1]<stdout>:[clus5:13374] snapc:single:app do_checkpoint: RESTART (3)
> *[1,1]<stdout>:[clus5:13374] mca_oob_tcp_init: creating listen socket*
> *[1,1]<stdout>:[clus5:13374] mca_oob_tcp_init: unable to create IPv4
> listen socket: Unable to open a TCP socket for out-of-band communications*
> [1,1]<stdout>:[clus5:13374] App) notify_response: Waiting for final
> handshake*.*
> [1,1]<stdout>:[clus5:13374] App) update_status: Update checkpoint status
> (13, /tmp/radic/1) for [[34224,1],1]
> [1,0]<stdout>:INICIEI O BROADCAST (6)
> [1,0]<stdout>:FINALIZEI O BROADCAST (6)
> [1,0]<stdout>:INICIEI O BROADCAST
> [1,3]<stdout>:INICIEI O BROADCAST (6)
> [1,3]<stdout>:FINALIZEI O BROADCAST (6)
> [1,3]<stdout>:INICIEI O BROADCAST
> *[1,1]<stdout>:[clus5:13374] [[34224,1],1] errmgr:app: job [34224,0]
> reported state COMMUNICATION FAILURE for proc [[34224,0],1] state
> COMMUNICATION FAILURE exit_code 1*
> *[1,1]<stdout>:[clus5:13374] [[34224,1],1] routed:cm: Connection to
> lifeline [[34224,0],1] lost*
>
>
> I'm thinking that this error ocurrs because the process want to create the
> socket using the port that was previously assigned to it. So, if i want to
> restart it using another port or something how the other daemons and process
> will find out about this? Is this a good choice?
>
> Best regards.
>
> Hugo Meyer
>
> 2011/3/31 Hugo Meyer <meyer.hugo_at_[hidden]>
>
>> Ok Ralph.
>> Thanks a lot, i will resend this message with a new subject.
>>
>> Best Regards.
>>
>> Hugo
>>
>>
>> 2011/3/31 Ralph Castain <rhc_at_[hidden]>
>>
>>> Sorry - should have included the devel list when I sent this.
>>>
>>>
>>> On Mar 30, 2011, at 6:11 PM, Ralph Castain wrote:
>>>
>>> I'm not the expert on this area - Josh is, so I'll defer to him. I did
>>> take a quick glance at the sstore framework, though, and it looks like there
>>> are some params you could set that might help.
>>>
>>> "ompi_info --param sstore all"
>>>
>>> should tell you what's available. Also, note that Josh created a man page
>>> to explain how sstore works. It's in section 7, looks like "man orte_sstore"
>>> should get it.
>>>
>>>
>>> On Mar 30, 2011, at 3:09 PM, Hugo Meyer wrote:
>>>
>>> Hello again.
>>>
>>> I'm working in the launch code to handle my checkpoints, but i'm a little
>>> stuck in how to set the path to my checkpoint and the executable
>>> (ompi_blcr_context.PID). I take a look at the code in
>>> odls_base_default_fns.c and this piece of code took my attention:
>>>
>>> #if OPAL_ENABLE_FT_CR == 1
>>> /*
>>> * OPAL CRS components need the opportunity to take action
>>> before a process
>>> * is forked.
>>> * Needs access to:
>>> * - Environment
>>> * - Rank/ORTE Name
>>> * - Binary to exec
>>> */
>>> if( NULL != opal_crs.crs_prelaunch ) {
>>> if( OPAL_SUCCESS != (rc =
>>> opal_crs.crs_prelaunch(child->name->vpid,
>>>
>>> orte_sstore_base_prelaunch_location,
>>>
>>> &(app->app),
>>>
>>> &(app->cwd),
>>>
>>> &(app->argv),
>>>
>>> &(app->env) ) ) ) {
>>> ORTE_ERROR_LOG(rc);
>>> goto CLEANUP;
>>> }
>>> }
>>> #endif
>>>
>>>
>>> But i didn't find out how to set orte_sstore_base_prelaunch_location, i
>>> now that initially this is set in the sstore_base_open. For example, as i'm
>>> transfering my checkpoint from one node to another, i store the checkpoint
>>> that has to be restore in /tmp/1/ and it has a name
>>> like ompi_blcr_context.PID.
>>>
>>> Is there any function that i didn't see that allows me to do this? I'm
>>> asking this because I do not want to change the signature of the
>>> functions to pass the details of the checkpoint and the PID.
>>>
>>> Best Regards.
>>>
>>> Hugo Meyer
>>>
>>> 2011/3/30 Hugo Meyer <meyer.hugo_at_[hidden]>
>>>
>>>> Thanks Ralph.
>>>> I have finished the (a) point, and now its working, now i have to work
>>>> to relaunch from my checkpoint as you said.
>>>>
>>>> Best regards.
>>>>
>>>> Hugo Meyer
>>>>
>>>>
>>>> 2011/3/29 Ralph Castain <rhc_at_[hidden]>
>>>>
>>>>> The resilient mapper -only- works on procs being restarted - it cannot
>>>>> map a job for its initial launch. You shouldn't set any rmaps flag and
>>>>> things will work correctly - the default round-robin mapper will map the
>>>>> initial launch, and then the resilient mapper will handle restarts.
>>>>>
>>>>>
>>>>> On Mar 29, 2011, at 5:18 AM, Hugo Meyer wrote:
>>>>>
>>>>> Ralph.
>>>>>
>>>>> I'm having a problem when i try to select the rmaps resilient to be
>>>>> used:
>>>>>
>>>>> /home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun -v -np 4
>>>>> --hostfile ../hostfile --bynode -mca rmaps resilient -mca vprotocol receiver
>>>>> -mca plm rsh -mca routed cm ./coll 6 10 2>out.txt
>>>>>
>>>>>
>>>>> I get this as error:
>>>>>
>>>>> [clus9:25568] [[53334,0],0] hostfile: checking hostfile ../hostfile for
>>>>> nodes
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>> Your job failed to map. Either no mapper was available, or none
>>>>> of the available mappers was able to perform the requested
>>>>> mapping operation. This can happen if you request a map type
>>>>> (e.g., loadbalance) and the corresponding mapper was not built.
>>>>>
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>> [clus9:25568] errmgr:hnp:update_state() [[53334,0],0]) ------- App.
>>>>> Process state updated for process NULL
>>>>> [clus9:25568] [[53334,0],0] errmgr:hnp: job [53334,0] reported state
>>>>> NEVER LAUNCHED for proc NULL state UNDEFINED pid 0 exit_code 1
>>>>> [clus9:25568] [[53334,0],0] errmgr:hnp: job [53334,0] reported state
>>>>> NEVER LAUNCHED
>>>>> [clus9:25568] [[53334,0],0] errmgr:hnp: abort called on job [53334,0]
>>>>> with status 1
>>>>>
>>>>>
>>>>> Is there a flag that i'm not turning on? or a component that i should
>>>>> have selected?
>>>>>
>>>>> Thanks again.
>>>>>
>>>>> Hugo Meyer
>>>>>
>>>>>
>>>>> 2011/3/26 Hugo Meyer <meyer.hugo_at_[hidden]>
>>>>>
>>>>>> Ok Ralph.
>>>>>>
>>>>>> Thanks a lot for your help, i will do as you said and then let you
>>>>>> know how it goes.
>>>>>>
>>>>>> Best Regards.
>>>>>>
>>>>>> Hugo Meyer
>>>>>>
>>>>>>
>>>>>> 2011/3/25 Ralph Castain <rhc_at_[hidden]>
>>>>>>
>>>>>>>
>>>>>>> On Mar 25, 2011, at 10:48 AM, Hugo Meyer wrote:
>>>>>>>
>>>>>>> From what you've described before, I suspect all you'll need to do is
>>>>>>>> add some code in orte/mca/odls/base/odls_base_default_fns.c that (a) checks
>>>>>>>> to see if a process in the launch message is being relocated (the
>>>>>>>> construct_child_list code does that already), and then (b) sends the
>>>>>>>> required info to all local child processes so they can take appropriate
>>>>>>>> action.
>>>>>>>>
>>>>>>>> Failure detection, re-launch, etc. have all been taken care of for
>>>>>>>> you.
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I looked at the code that you mentioned me and i realize that i have
>>>>>>> two possible options, that i'm going to share with you to know your opinion.
>>>>>>>
>>>>>>> First of all i will let you know my actual situation with the
>>>>>>> implementation. As i'm working in a Fault Tolerant system, but using
>>>>>>> uncoordinated checkpoint i'm taking checkpoints of all my process at
>>>>>>> different time and storing them on the machine where there are residing, but
>>>>>>> i also send this checkpoints to another node (lets call it protector), so if
>>>>>>> this node fails his process should be restarted in the protector that have
>>>>>>> his checkpoints.
>>>>>>>
>>>>>>> Right now i'm detecting the failure of a process and i know where
>>>>>>> this process should be restarted, and also i have the checkpoint in the
>>>>>>> protector. And i also have the child information of course.
>>>>>>>
>>>>>>> So, my options are:
>>>>>>> *First Option*
>>>>>>> *
>>>>>>> *
>>>>>>> I detect the failure, and then i use
>>>>>>> orte_errmgr_hnp_base_global_update_state() with some modifications and the
>>>>>>> hnp_relocate but changing the spawning to make a restart from a checkpoint,
>>>>>>> i suposse that using this, the migration of the process to another node will
>>>>>>> be updated and everyone will know it, because is the hnp who is going to do
>>>>>>> this (is this ok?).
>>>>>>>
>>>>>>>
>>>>>>> This is the option I would use. The other one is much, much more
>>>>>>> work. In this option, you only have to:
>>>>>>>
>>>>>>> (a) modify the mapper so you can specify the location of the proc
>>>>>>> being restarted. The resilient mapper module will be handling the restart -
>>>>>>> if you look at orte/mca/rmaps/resilient/rmaps_resilient.c, you can see the
>>>>>>> code doing the "replacement" and modify accordingly.
>>>>>>>
>>>>>>> (b) add any required info about your checkpoint to the launch
>>>>>>> message. This gets created in orte/mca/odls/base/odls_base_default_fns.c,
>>>>>>> the "get_add_procs_data" function (at the top of the file).
>>>>>>>
>>>>>>> (c) modify the launch code to handle your checkpoint, if required -
>>>>>>> see the file in (b), the "construct_child" and "launch" functions.
>>>>>>>
>>>>>>> HTH
>>>>>>> Ralph
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *Second Option*
>>>>>>> *
>>>>>>> *
>>>>>>> Modify one of the spawn variations(probably the remote_spawn from
>>>>>>> rsh) in the PLM framework and then use the orted_comm to command a
>>>>>>> remote_spawn in the protector, but i don't know here how to update the info
>>>>>>> so everyone knows about the change or how this is managed.
>>>>>>>
>>>>>>> I might be very wrong in what I said, my apologies if so.
>>>>>>>
>>>>>>> Thanks a lot for all the help.
>>>>>>>
>>>>>>> Best regards.
>>>>>>>
>>>>>>> Hugo Meyer
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> devel_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> devel_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>
>>>>>>
>>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>




  • application/octet-stream attachment: out