Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] checkpoint opempi-1.3.3+sge62
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2009-11-06 09:08:43


On Oct 28, 2009, at 7:41 AM, Sergio Díaz wrote:

> Hello,
>
> I have achieved the checkpoint of an easy program without SGE. Now,
> I'm trying to do the integration openmpi+sge but I have some
> problems... When I try to do checkpoint of the mpirun PID, I got an
> error similar to the error gotten when the PID doesn't exit. The
> example below.

I do not have any experience with the SGE environment, so I suspect
that there may something 'special' about the environment that is
tripping up the ompi-checkpoint tool.

First of all, what version of Open MPI are you using?

Somethings to check:
  - Does 'ompi-ps' work when your application is running?
  - Is there an /tmp/openmpi-sessions-* directory on the node where
mpirun is currently running? This directory contains information on
how to connect to the mpirun process from an external tool, if it's
missing then this could be the cause of the problem.

>
> Any ideas?
> Somebody have a script to do it automatic with SGE?. For example I
> have one to do checkpoint each X seconds with BLCR and non-mpi jobs.
> It is launched by SGE if you have configured the queue and the ckpt
> environment.

I do not know of any integration of the Open MPI checkpointing work
with SGE at the moment.

As far as time triggered checkpointing, I have a feature ticket open
about this:
   https://svn.open-mpi.org/trac/ompi/ticket/1961

It is not available yet, but in the works.

>
> Is it possible choose the name of the ckpt folder when you do the
> ompi-checkpoint? I can't find the option to do it.

Not at this time. Though I could see it as a useful feature, and
shouldn't be too hard to implement. I filed a ticket if you want to
follow the progress:
   https://svn.open-mpi.org/trac/ompi/ticket/2098

-- Josh

>
>
> Regards,
> Sergio
>
>
> --------------------------------
>
> [sdiaz_at_compute-3-17 ~]$ ps auxf
> ....
> root 20044 0.0 0.0 4468 1224 ? S 13:28 0:00 \_
> sge_shepherd-2645150 -bg
> sdiaz 20072 0.0 0.0 53172 1212 ? Ss 13:28 0:00
> \_ -bash /opt/cesga/sge62/default/spool/compute-3-17/job_scripts/
> 2645150
> sdiaz 20112 0.2 0.0 41028 2480 ? S 13:28
> 0:00 \_ mpirun -np 2 -am ft-enable-cr pi3
> sdiaz 20113 0.0 0.0 36484 1824 ? Sl 13:28
> 0:00 \_ /opt/cesga/sge62/bin/lx24-x86/qrsh -inherit -
> nostdin -V compute-3-18..........
> sdiaz 20116 1.2 0.0 99464 4616 ? Sl 13:28
> 0:00 \_ pi3
>
>
> [sdiaz_at_compute-3-17 ~]$ ompi-checkpoint 20112
> [compute-3-17.local:20124] HNP with PID 20112 Not found!
>
> [sdiaz_at_compute-3-17 ~]$ ompi-checkpoint -s 20112
> [compute-3-17.local:20135] HNP with PID 20112 Not found!
>
> [sdiaz_at_compute-3-17 ~]$ ompi-checkpoint -s --term 20112
> [compute-3-17.local:20136] HNP with PID 20112 Not found!
>
> [sdiaz_at_compute-3-17 ~]$ ompi-checkpoint --hnp-pid 20112
> --------------------------------------------------------------------------
> ompi-checkpoint PID_OF_MPIRUN
> Open MPI Checkpoint Tool
>
> -am <arg0> Aggregate MCA parameter set file list
> -gmca|--gmca <arg0> <arg1>
> Pass global MCA parameters that are
> applicable to
> all contexts (arg0 is the parameter name;
> arg1 is
> the parameter value)
> -h|--help This help message
> --hnp-jobid <arg0> This should be the jobid of the HNP whose
> applications you wish to checkpoint.
> --hnp-pid <arg0> This should be the pid of the mpirun whose
> applications you wish to checkpoint.
> -mca|--mca <arg0> <arg1>
> Pass context-specific MCA parameters; they
> are
> considered global if --gmca is not used and
> only
> one context is specified (arg0 is the
> parameter
> name; arg1 is the parameter value)
> -s|--status Display status messages describing the
> progression
> of the checkpoint
> --term Terminate the application after checkpoint
> -v|--verbose Be Verbose
> -w|--nowait Do not wait for the application to finish
> checkpointing before returning
>
> --------------------------------------------------------------------------
> [sdiaz_at_compute-3-17 ~]$ exit
> logout
> Connection to c3-17 closed.
> [sdiaz_at_svgd mpi_test]$ ssh c3-18
> Last login: Wed Oct 28 13:24:12 2009 from svgd.local
> -bash-3.00$ ps auxf |grep sdiaz
>
> sdiaz 14412 0.0 0.0 1888 560 ? Ss 13:28 0:00
> \_ /opt/cesga/sge62/utilbin/lx24-x86/qrsh_starter /opt/cesga/sge62/
> default/spool/compute-3-18/active_jobs/2645150.1/1.compute-3-18
> sdiaz 14419 0.0 0.0 35728 2260 ? S 13:28
> 0:00 \_ orted -mca ess env -mca orte_ess_jobid 2295267328 -
> mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri
> 2295267328.0;tcp://192.168.4.144:36596 -mca
> mca_base_param_file_prefix ft-enable-cr -mca
> mca_base_param_file_path /opt/cesga/openmpi-1.3.3/share/openmpi/amca-
> param-sets:/home_no_usc/cesga/sdiaz/mpi_test -mca
> mca_base_param_file_path_force /home_no_usc/cesga/sdiaz/mpi_test
> sdiaz 14420 0.0 0.0 99452 4596 ? Sl 13:28
> 0:00 \_ pi3
>
>
>
>
>
> --
> Sergio Díaz Montes
> Centro de Supercomputacion de Galicia
> Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela (Spain)
> Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16
> email: sdiaz_at_[hidden] ; http://www.cesga.es/
> <image002.jpg>
> ------------------------------------------------
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users