Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Restarting processes on different node
From: Leonardo Fialho (lfialho_at_[hidden])
Date: 2008-10-23 08:08:17


Thanks Paul,

It´s working fine with PRELINK=NO.

Leonardo

Paul H. Hargrove escribió:
> Leonardo,
>
> As you say, there is the possiblity that moving from one node to
> another has caused problems due to different shared libraries. The
> result from this could be a segmentation fault, an illegal instruction
> or even a bus error. In all three cases, however, this failure
> generates a signal (SIGSEGV, SIGILL or SIGBUG). So, it is possible
> that you are seeing the failure mode that you were expecting.
> There are at least 2 ways you can deal with heterogenous libaries.
> The first is that if the libs are only different due to preloading,
> you can undo the preloading as described in the BLCR FAQ
> (http://mantis.lbl.gov/blcr/doc/html/FAQ.html#prelink)
> The second would be to include the shared libaries in the checpoint
> itself. While this is very costly in terms of storage, you may find
> it lets you restart in cases where you might not otherwise be able
> to. The trick is to add --save-private or --save-all to the
> checkpoint command that OpenMPI uses to checkpoint the application
> processes.
>
> -Paul
>
> Leonardo Fialho wrote:
>> Hi All,
>>
>> I´m trying to implement my FT architecture in Open MPI. Just now I
>> need to restart a faulty process from a checkpoint. I saw that Josh
>> uses orte-restart which call opal-restart through an ordinary mpirun
>> call. It´s now good for me because in this case the restarted process
>> becomes in a new job. I need to restart the process checkpoint in the
>> same job and in another node under an existing orted. The checkpoints
>> are taken without the "--term" option.
>>
>> My modified orted receive a "restart request" from my modified
>> heartbeat mechanism. I have tried to restart using the BLCR
>> cr_restart command. It does not work, I think because the
>> stderr/stdin/stdout was not handled by the opal environment. So, I
>> tried to restart the checkpoint forking the orted and doing an execvp
>> to the opal-restart. It recovers the checkpoint, but after the
>> "opal_cr_init", it dies (*** Process received signal ***).
>>
>> As follows is the job structure (from ompi-ps) after a fault:
>>
>> Process Name | ORTE Name | Local Rank | PID | Node | State
>> | HB Dest. |
>> -------------------------------------------------------------------------------------
>>
>> orterun | [[8002,0],0] | 65535 | 30434 | aoclsb | Running
>> | |
>> orted | [[8002,0],1] | 65535 | 30435 | nodo1 | Running |
>> [[8002,0],3] |
>> orted | [[8002,0],2] | 65535 | 30438 | nodo2 | Faulty |
>> [[8002,0],3] |
>> orted | [[8002,0],3] | 65535 | 30441 | nodo3 | Running |
>> [[8002,0],4] |
>> orted | [[8002,0],4] | 65535 | 30444 | nodo4 | Running |
>> [[8002,0],1] |
>>
>>
>> Process Name | ORTE Name | Local Rank | PID | Node | State
>> | Ckpt State | Ckpt Loc | Protector |
>> ------------------------------------------------------------------------------------------------------------------
>>
>> ./ping/wait | [[8002,1],0] | 0 | 9069 | nodo1 | Running
>> | Finished | /tmp/radic/0 | [[8002,0],2] |
>> ./ping/wait | [[8002,1],1] | 0 | 6086 | nodo2 | Restoring
>> | Finished | /tmp/radic/1 | [[8002,0],3] |
>> ./ping/wait | [[8002,1],2] | 0 | 5864 | nodo3 | Running
>> | Finished | /tmp/radic/2 | [[8002,0],4] |
>> ./ping/wait | [[8002,1],3] | 0 | 7405 | nodo4 | Running
>> | Finished | /tmp/radic/3 | [[8002,0],1] |
>>
>>
>> The orted running on "nodo2" dies. It was detected by the orted
>> [[8002,0],1] running on "nodo1" and informed to the HNP. The HNP
>> update the procs structure and look for processes running on the
>> faulty node, so it sends a restart request for the orted which holds
>> the checkpoint of the faulty processes.
>>
>> Below is the log generated:
>>
>> [aoclsb:30434] [[8002,0],0] orted_recv: update state request from
>> [[8002,0],3]
>> [aoclsb:30434] [[8002,0],0] orted_update_state: updating state (17)
>> for orted process (vpid=2)
>> [aoclsb:30434] [[8002,0],0] orted_update_state: found process
>> [[8002,1],1] on node nodo2, requesting recovery task for that
>> [aoclsb:30434] [[8002,0],0] orted_update_state: sending restore
>> ([[8002,1],1] process) request to [[8002,0],3]
>> [nodo3:05841] [[8002,0],3] orted_recv: restore checkpoint request
>> from [[8002,0],0]
>> [nodo3:05841] [[8002,0],3] orted_restore_checkpoint: restarting
>> process from checkpoint file (/tmp/radic/1/ompi_blcr_context.6086)
>> [nodo3:05841] [[8002,0],3] orted_restore_checkpoint: executing
>> restart (opal-restart -mca crs_base_snapshot_dir /tmp/radic/1 .)
>> [nodo3:05924] opal_cr: init: Verbose Level: 1024
>> [nodo3:05924] opal_cr: init: FT Enabled: 1
>> [nodo3:05924] opal_cr: init: Is a tool program: 1
>> [nodo3:05924] opal_cr: init: Checkpoint Signal: 10
>> [nodo3:05924] opal_cr: init: Debug SIGPIPE: 0 (False)
>> [nodo3:05924] opal_cr: init: Temp Directory: /tmp
>> [nodo2:05965] *** Process received signal ***
>>
>> The orted which receives the restart request forks and the call an
>> execvp for the opal-restart, and then, unfortunately, it dies. I know
>> that the restarted process should generate errors because the URI of
>> it daemon is incorrect like all other enviroment variables, but it
>> would generate a communication error, or any kind of error other than
>> a process kill. My question is:
>>
>> 1) Why this process dies? I suspect that the checkpoint have pointers
>> which points to libraries which are not loaded, or are loaded on
>> different memory position (because this checkpoint becomes from
>> another node). In this case the error should be "segmentation fault"
>> or something like this, no?
>>
>>
>> If somebody have some information or can give me some help about this
>> error I´ll be grateful.
>>
>> Thanks--
>>
>> Leonardo Fialho
>> Computer Architecture and Operating Systems Department - CAOS
>> Universidad Autonoma de Barcelona - UAB
>> ETSE, Edifcio Q, QC/3088
>> http://www.caos.uab.es
>> Phone: +34-93-581-2888
>> Fax: +34-93-581-2478
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>

-- 
Leonardo Fialho
Computer Architecture and Operating Systems Department - CAOS
Universidad Autonoma de Barcelona - UAB
ETSE, Edifcio Q, QC/3088
http://www.caos.uab.es
Phone: +34-93-581-2888
Fax: +34-93-581-2478