Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] checkpoint opempi-1.3.3+sge62
From: Andreea m. \(Costea\) (doodlie_snew_at_[hidden])
Date: 2009-11-02 19:56:15


I am having the same problem when I want to checkpoint manually: "HNP with PID xxxx Not found!", though I am sure I put the right PID

--- On Mon, 11/2/09, Sergio Díaz <sdiaz_at_[hidden]> wrote:

From: Sergio Díaz <sdiaz_at_[hidden]>
Subject: Re: [OMPI users] checkpoint opempi-1.3.3+sge62
To: "Open MPI Users" <users_at_[hidden]>
Date: Monday, November 2, 2009, 6:43 PM

Hi again,

I found a C program to test ompi-checkpoint/restart an it works fine. The program was written by Alan Woodland and shared in the following distribution list: debian-bugs-dist_at_[hidden]
This program starts a countdown from 10 to 0 and when the countdown is 6, do a checkpoint, kill the process and restart the process.

However, I still have the problem when I try to do (by hand) checkpointing directly into a node

Any ideas? :-(

Best regards
Sergio

Sergio Díaz escribió:
> Hello,
>
> I have achieved the checkpoint of an easy program without SGE. Now, I'm trying to do the integration openmpi+sge but I have some problems... When I try to do checkpoint of the mpirun PID, I got an error similar to the error gotten when the PID doesn't exit. The example below.
>
> Any ideas?
> Somebody have a script to do it automatic with SGE?. For example I have one to do checkpoint each X seconds with BLCR and non-mpi jobs. It is launched by SGE if you have configured the queue and the ckpt environment.
>
> Is it possible choose the name of the ckpt folder when you do the ompi-checkpoint? I can't find the option to do it.
>
>
> Regards,
> Sergio
>
>
> --------------------------------
>
> [sdiaz_at_compute-3-17 ~]$ ps auxf
> ....
> root     20044  0.0  0.0  4468 1224 ?        S    13:28   0:00  \_ sge_shepherd-2645150 -bg
> sdiaz    20072  0.0  0.0 53172 1212 ?        Ss   13:28   0:00      \_ -bash /opt/cesga/sge62/default/spool/compute-3-17/job_scripts/2645150
> sdiaz    20112  0.2  0.0 41028 2480 ?        S    13:28   0:00          \_ mpirun -np 2 -am ft-enable-cr pi3
> sdiaz    20113  0.0  0.0 36484 1824 ?        Sl   13:28   0:00              \_ /opt/cesga/sge62/bin/lx24-x86/qrsh -inherit -nostdin -V compute-3-18..........
> sdiaz    20116  1.2  0.0 99464 4616 ?        Sl   13:28   0:00              \_ pi3
>
>
> [sdiaz_at_compute-3-17 ~]$ ompi-checkpoint 20112
> [compute-3-17.local:20124] HNP with PID 20112 Not found!
>
> [sdiaz_at_compute-3-17 ~]$ ompi-checkpoint -s 20112
> [compute-3-17.local:20135] HNP with PID 20112 Not found!
>
> [sdiaz_at_compute-3-17 ~]$ ompi-checkpoint -s --term 20112
> [compute-3-17.local:20136] HNP with PID 20112 Not found!
>
> [sdiaz_at_compute-3-17 ~]$ ompi-checkpoint --hnp-pid 20112
> --------------------------------------------------------------------------
> ompi-checkpoint PID_OF_MPIRUN
>   Open MPI Checkpoint Tool
>
>    -am <arg0>            Aggregate MCA parameter set file list
>    -gmca|--gmca <arg0> <arg1>
>                          Pass global MCA parameters that are applicable to
>                          all contexts (arg0 is the parameter name; arg1 is
>                          the parameter value)
> -h|--help                This help message
>    --hnp-jobid <arg0>    This should be the jobid of the HNP whose
>                          applications you wish to checkpoint.
>    --hnp-pid <arg0>      This should be the pid of the mpirun whose
>                          applications you wish to checkpoint.
>    -mca|--mca <arg0> <arg1>
>                          Pass context-specific MCA parameters; they are
>                          considered global if --gmca is not used and only
>                          one context is specified (arg0 is the parameter
>                          name; arg1 is the parameter value)
> -s|--status              Display status messages describing the progression
>                          of the checkpoint
>    --term                Terminate the application after checkpoint
> -v|--verbose             Be Verbose
> -w|--nowait              Do not wait for the application to finish
>                          checkpointing before returning
>
> --------------------------------------------------------------------------
> [sdiaz_at_compute-3-17 ~]$ exit
> logout
> Connection to c3-17 closed.
> [sdiaz_at_svgd mpi_test]$ ssh c3-18
> Last login: Wed Oct 28 13:24:12 2009 from svgd.local
> -bash-3.00$ ps auxf |grep sdiaz
>
> sdiaz    14412  0.0  0.0  1888  560 ?        Ss   13:28   0:00      \_ /opt/cesga/sge62/utilbin/lx24-x86/qrsh_starter /opt/cesga/sge62/default/spool/compute-3-18/active_jobs/2645150.1/1.compute-3-18
> sdiaz    14419  0.0  0.0 35728 2260 ?        S    13:28   0:00          \_ orted -mca ess env -mca orte_ess_jobid 2295267328 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri 2295267328.0;tcp://192.168.4.144:36596 -mca mca_base_param_file_prefix ft-enable-cr -mca mca_base_param_file_path /opt/cesga/openmpi-1.3.3/share/openmpi/amca-param-sets:/home_no_usc/cesga/sdiaz/mpi_test -mca mca_base_param_file_path_force /home_no_usc/cesga/sdiaz/mpi_test
> sdiaz    14420  0.0  0.0 99452 4596 ?        Sl   13:28   0:00              \_ pi3
>
>
>
>
>
> -- Sergio Díaz Montes
> Centro de Supercomputacion de Galicia
> Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela (Spain)
> Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16
> email: sdiaz_at_[hidden] ; http://www.cesga.es/
>
> ------------------------------------------------
> ------------------------------------------------------------------------
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- Sergio Díaz Montes
Centro de Supercomputacion de Galicia
Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela (Spain)
Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16
email: sdiaz_at_[hidden] ; http://www.cesga.es/

------------------------------------------------
_______________________________________________
users mailing list
users_at_[hidden]
http://www.open-mpi.org/mailman/listinfo.cgi/users