Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] checkpoint opempi-1.3.3+sge62
From: Sergio Díaz (sdiaz_at_[hidden])
Date: 2009-11-02 11:43:45


Hi again,

I found a C program to test ompi-checkpoint/restart an it works fine.
The program was written by Alan Woodland and shared in the following
distribution list: debian-bugs-dist_at_[hidden]
This program starts a countdown from 10 to 0 and when the countdown is
6, do a checkpoint, kill the process and restart the process.

However, I still have the problem when I try to do (by hand)
checkpointing directly into a node

Any ideas? :-(

Best regards
Sergio

Sergio Díaz escribió:
> Hello,
>
> I have achieved the checkpoint of an easy program without SGE. Now,
> I'm trying to do the integration openmpi+sge but I have some
> problems... When I try to do checkpoint of the mpirun PID, I got an
> error similar to the error gotten when the PID doesn't exit. The
> example below.
>
> Any ideas?
> Somebody have a script to do it automatic with SGE?. For example I
> have one to do checkpoint each X seconds with BLCR and non-mpi jobs.
> It is launched by SGE if you have configured the queue and the ckpt
> environment.
>
> Is it possible choose the name of the ckpt folder when you do the
> ompi-checkpoint? I can't find the option to do it.
>
>
> Regards,
> Sergio
>
>
> --------------------------------
>
> [sdiaz_at_compute-3-17 ~]$ ps auxf
> ....
> root 20044 0.0 0.0 4468 1224 ? S 13:28 0:00 \_
> sge_shepherd-2645150 -bg
> sdiaz 20072 0.0 0.0 53172 1212 ? Ss 13:28 0:00 \_
> -bash /opt/cesga/sge62/default/spool/compute-3-17/job_scripts/2645150
> sdiaz 20112 0.2 0.0 41028 2480 ? S 13:28
> 0:00 \_ mpirun -np 2 -am ft-enable-cr pi3
> sdiaz 20113 0.0 0.0 36484 1824 ? Sl 13:28
> 0:00 \_ /opt/cesga/sge62/bin/lx24-x86/qrsh -inherit
> -nostdin -V compute-3-18..........
> sdiaz 20116 1.2 0.0 99464 4616 ? Sl 13:28
> 0:00 \_ pi3
>
>
> [sdiaz_at_compute-3-17 ~]$ ompi-checkpoint 20112
> [compute-3-17.local:20124] HNP with PID 20112 Not found!
>
> [sdiaz_at_compute-3-17 ~]$ ompi-checkpoint -s 20112
> [compute-3-17.local:20135] HNP with PID 20112 Not found!
>
> [sdiaz_at_compute-3-17 ~]$ ompi-checkpoint -s --term 20112
> [compute-3-17.local:20136] HNP with PID 20112 Not found!
>
> [sdiaz_at_compute-3-17 ~]$ ompi-checkpoint --hnp-pid 20112
> --------------------------------------------------------------------------
> ompi-checkpoint PID_OF_MPIRUN
> Open MPI Checkpoint Tool
>
> -am <arg0> Aggregate MCA parameter set file list
> -gmca|--gmca <arg0> <arg1>
> Pass global MCA parameters that are applicable to
> all contexts (arg0 is the parameter name; arg1 is
> the parameter value)
> -h|--help This help message
> --hnp-jobid <arg0> This should be the jobid of the HNP whose
> applications you wish to checkpoint.
> --hnp-pid <arg0> This should be the pid of the mpirun whose
> applications you wish to checkpoint.
> -mca|--mca <arg0> <arg1>
> Pass context-specific MCA parameters; they are
> considered global if --gmca is not used and only
> one context is specified (arg0 is the parameter
> name; arg1 is the parameter value)
> -s|--status Display status messages describing the
> progression
> of the checkpoint
> --term Terminate the application after checkpoint
> -v|--verbose Be Verbose
> -w|--nowait Do not wait for the application to finish
> checkpointing before returning
>
> --------------------------------------------------------------------------
> [sdiaz_at_compute-3-17 ~]$ exit
> logout
> Connection to c3-17 closed.
> [sdiaz_at_svgd mpi_test]$ ssh c3-18
> Last login: Wed Oct 28 13:24:12 2009 from svgd.local
> -bash-3.00$ ps auxf |grep sdiaz
>
> sdiaz 14412 0.0 0.0 1888 560 ? Ss 13:28 0:00 \_
> /opt/cesga/sge62/utilbin/lx24-x86/qrsh_starter
> /opt/cesga/sge62/default/spool/compute-3-18/active_jobs/2645150.1/1.compute-3-18
> sdiaz 14419 0.0 0.0 35728 2260 ? S 13:28
> 0:00 \_ orted -mca ess env -mca orte_ess_jobid 2295267328
> -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri
> 2295267328.0;tcp://192.168.4.144:36596 -mca mca_base_param_file_prefix
> ft-enable-cr -mca mca_base_param_file_path
> /opt/cesga/openmpi-1.3.3/share/openmpi/amca-param-sets:/home_no_usc/cesga/sdiaz/mpi_test
> -mca mca_base_param_file_path_force /home_no_usc/cesga/sdiaz/mpi_test
> sdiaz 14420 0.0 0.0 99452 4596 ? Sl 13:28
> 0:00 \_ pi3
>
>
>
>
>
> --
> Sergio Díaz Montes
> Centro de Supercomputacion de Galicia
> Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela (Spain)
> Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16
> email: sdiaz_at_[hidden] ; http://www.cesga.es/
>
> ------------------------------------------------
> ------------------------------------------------------------------------
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Sergio Díaz Montes
Centro de Supercomputacion de Galicia
Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela (Spain)
Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16
email: sdiaz_at_[hidden] ; http://www.cesga.es/
------------------------------------------------