Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Restarting processes on different node
From: Paul H. Hargrove (PHHargrove_at_[hidden])
Date: 2008-10-22 20:06:20


Leonardo,

  As you say, there is the possiblity that moving from one node to
another has caused problems due to different shared libraries. The
result from this could be a segmentation fault, an illegal instruction
or even a bus error. In all three cases, however, this failure
generates a signal (SIGSEGV, SIGILL or SIGBUG). So, it is possible that
you are seeing the failure mode that you were expecting.
  There are at least 2 ways you can deal with heterogenous libaries.
The first is that if the libs are only different due to preloading, you
can undo the preloading as described in the BLCR FAQ
(http://mantis.lbl.gov/blcr/doc/html/FAQ.html#prelink)
  The second would be to include the shared libaries in the checpoint
itself. While this is very costly in terms of storage, you may find it
lets you restart in cases where you might not otherwise be able to. The
trick is to add --save-private or --save-all to the checkpoint command
that OpenMPI uses to checkpoint the application processes.

-Paul

Leonardo Fialho wrote:
> Hi All,
>
> I´m trying to implement my FT architecture in Open MPI. Just now I
> need to restart a faulty process from a checkpoint. I saw that Josh
> uses orte-restart which call opal-restart through an ordinary mpirun
> call. It´s now good for me because in this case the restarted process
> becomes in a new job. I need to restart the process checkpoint in the
> same job and in another node under an existing orted. The checkpoints
> are taken without the "--term" option.
>
> My modified orted receive a "restart request" from my modified
> heartbeat mechanism. I have tried to restart using the BLCR cr_restart
> command. It does not work, I think because the stderr/stdin/stdout was
> not handled by the opal environment. So, I tried to restart the
> checkpoint forking the orted and doing an execvp to the opal-restart.
> It recovers the checkpoint, but after the "opal_cr_init", it dies (***
> Process received signal ***).
>
> As follows is the job structure (from ompi-ps) after a fault:
>
> Process Name | ORTE Name | Local Rank | PID | Node | State
> | HB Dest. |
> -------------------------------------------------------------------------------------
>
> orterun | [[8002,0],0] | 65535 | 30434 | aoclsb | Running
> | |
> orted | [[8002,0],1] | 65535 | 30435 | nodo1 | Running |
> [[8002,0],3] |
> orted | [[8002,0],2] | 65535 | 30438 | nodo2 | Faulty |
> [[8002,0],3] |
> orted | [[8002,0],3] | 65535 | 30441 | nodo3 | Running |
> [[8002,0],4] |
> orted | [[8002,0],4] | 65535 | 30444 | nodo4 | Running |
> [[8002,0],1] |
>
>
> Process Name | ORTE Name | Local Rank | PID | Node | State
> | Ckpt State | Ckpt Loc | Protector |
> ------------------------------------------------------------------------------------------------------------------
>
> ./ping/wait | [[8002,1],0] | 0 | 9069 | nodo1 | Running
> | Finished | /tmp/radic/0 | [[8002,0],2] |
> ./ping/wait | [[8002,1],1] | 0 | 6086 | nodo2 | Restoring
> | Finished | /tmp/radic/1 | [[8002,0],3] |
> ./ping/wait | [[8002,1],2] | 0 | 5864 | nodo3 | Running
> | Finished | /tmp/radic/2 | [[8002,0],4] |
> ./ping/wait | [[8002,1],3] | 0 | 7405 | nodo4 | Running
> | Finished | /tmp/radic/3 | [[8002,0],1] |
>
>
> The orted running on "nodo2" dies. It was detected by the orted
> [[8002,0],1] running on "nodo1" and informed to the HNP. The HNP
> update the procs structure and look for processes running on the
> faulty node, so it sends a restart request for the orted which holds
> the checkpoint of the faulty processes.
>
> Below is the log generated:
>
> [aoclsb:30434] [[8002,0],0] orted_recv: update state request from
> [[8002,0],3]
> [aoclsb:30434] [[8002,0],0] orted_update_state: updating state (17)
> for orted process (vpid=2)
> [aoclsb:30434] [[8002,0],0] orted_update_state: found process
> [[8002,1],1] on node nodo2, requesting recovery task for that
> [aoclsb:30434] [[8002,0],0] orted_update_state: sending restore
> ([[8002,1],1] process) request to [[8002,0],3]
> [nodo3:05841] [[8002,0],3] orted_recv: restore checkpoint request from
> [[8002,0],0]
> [nodo3:05841] [[8002,0],3] orted_restore_checkpoint: restarting
> process from checkpoint file (/tmp/radic/1/ompi_blcr_context.6086)
> [nodo3:05841] [[8002,0],3] orted_restore_checkpoint: executing restart
> (opal-restart -mca crs_base_snapshot_dir /tmp/radic/1 .)
> [nodo3:05924] opal_cr: init: Verbose Level: 1024
> [nodo3:05924] opal_cr: init: FT Enabled: 1
> [nodo3:05924] opal_cr: init: Is a tool program: 1
> [nodo3:05924] opal_cr: init: Checkpoint Signal: 10
> [nodo3:05924] opal_cr: init: Debug SIGPIPE: 0 (False)
> [nodo3:05924] opal_cr: init: Temp Directory: /tmp
> [nodo2:05965] *** Process received signal ***
>
> The orted which receives the restart request forks and the call an
> execvp for the opal-restart, and then, unfortunately, it dies. I know
> that the restarted process should generate errors because the URI of
> it daemon is incorrect like all other enviroment variables, but it
> would generate a communication error, or any kind of error other than
> a process kill. My question is:
>
> 1) Why this process dies? I suspect that the checkpoint have pointers
> which points to libraries which are not loaded, or are loaded on
> different memory position (because this checkpoint becomes from
> another node). In this case the error should be "segmentation fault"
> or something like this, no?
>
>
> If somebody have some information or can give me some help about this
> error I´ll be grateful.
>
> Thanks--
>
> Leonardo Fialho
> Computer Architecture and Operating Systems Department - CAOS
> Universidad Autonoma de Barcelona - UAB
> ETSE, Edifcio Q, QC/3088
> http://www.caos.uab.es
> Phone: +34-93-581-2888
> Fax: +34-93-581-2478
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 
Paul H. Hargrove                          PHHargrove_at_[hidden]
Future Technologies Group
HPC Research Department                   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900